Performance of Large Language Models on a Neurology Board–Style Examination-Reference-Cited by-同舟云学术

Performance of Large Language Models on a Neurology Board–Style Examination

Published:2023-12-07 Issue:12 Volume:6 Page:e2346721
ISSN:2574-3805
Container-title:JAMA Network Open
language:en
Short-container-title:JAMA Netw Open

Author:

Schubert Marc Cicero¹²,Wick Wolfgang¹²,Venkataramani Varun¹²

Affiliation:

1. Neurology Clinic and National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany

2. Clinical Cooperation Unit Neurooncology, German Cancer Consortium, German Cancer Research Center, Heidelberg, Germany

Abstract

ImportanceRecent advancements in large language models (LLMs) have shown potential in a wide array of applications, including health care. While LLMs showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board examinations remains unexplored.ObjectiveTo assess the performance of LLMs on neurology board–style examinations.Design, Setting, and ParticipantsThis cross-sectional study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank resembling neurology board-style examination questions and was validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions based on the Bloom taxonomy for learning and assessment. Performance by LLM ChatGPT versions 3.5 (LLM 1) and 4 (LLM 2) was assessed in relation to overall scores, question type, and topics, along with the confidence level and reproducibility of answers.Main Outcomes and MeasuresOverall percentage scores of 2 LLMs.ResultsLLM 2 significantly outperformed LLM 1 by correctly answering 1662 of 1956 questions (85.0%) vs 1306 questions (66.8%) for LLM 1. Notably, LLM 2’s performance was greater than the mean human score of 73.8%, effectively achieving near-passing and passing grades in the neurology board–style examination. LLM 2 outperformed human users in behavioral, cognitive, and psychological–related questions and demonstrated superior performance to LLM 1 in 6 categories. Both LLMs performed better on lower-order than higher-order questions, with LLM 2 excelling in both lower-order and higher-order questions. Both models consistently used confident language, even when providing incorrect answers. Reproducible answers of both LLMs were associated with a higher percentage of correct answers than inconsistent answers.Conclusions and RelevanceDespite the absence of neurology-specific training, LLM 2 demonstrated commendable performance, whereas LLM 1 performed slightly below the human average. While higher-order cognitive tasks were more challenging for both models, LLM 2’s results were equivalent to passing grades in specialized neurology examinations. These findings suggest that LLMs could have significant applications in clinical neurology and health care with further refinements.

Publisher

American Medical Association (AMA)

Subject

General Medicine

Link

https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2812620/schubert_2023_oi_231362_1705951618.38606.pdf

Reference37 articles.

1. Use of artificial intelligence in clinical neurology.;Hillis;Semin Neurol,2022

2. Artificial intelligence for clinical decision support in neurology.;Pedersen;Brain Commun,2020

3. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models.;Kung;PLOS Digit Health,2023

4. How does ChatGPT perform on the United States Medical Licensing Examination? the implications of large language models for medical education and knowledge assessment.;Gilson;JMIR Med Educ,2023

5. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.;Mihalache;JAMA Ophthalmol,2023

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis;Journal of Healthcare Informatics Research;2024-09-14

2. Large language models in psychiatry: Opportunities and challenges;Psychiatry Research;2024-09

3. Supercharge Your Academic Productivity with Generative Artificial Intelligence;Journal of Medical Systems;2024-08-08

4. Besteht ChatGPT die neurologische Facharztprüfung? Eine kritische Betrachtung;psychopraxis. neuropraxis;2024-08-01

5. Large language models for accurate disease detection in electronic health records;2024-07-29