Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations-Reference-Cited by-同舟云学术

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

Published:2023-08-15 Issue:6 Volume:93 Page:1353-1365
ISSN:0148-396X
Container-title:Neurosurgery
language:en
Short-container-title:

Author:

Ali Rohaid¹,Tang Oliver Y.¹^ORCID,Connolly Ian D.²,Zadnik Sullivan Patricia L.¹,Shin John H.³,Fridley Jared S.¹,Asaad Wael F.¹³⁴⁵,Cielo Deus¹,Oyelese Adetokunbo A.¹,Doberstein Curtis E.¹,Gokaslan Ziya L.¹,Telfeian Albert E.¹

Affiliation:

1. Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;

2. Department of Neurosurgery, Massachusetts General Hospital, Boston, Massachusetts, USA;

3. Department of Neuroscience, Norman Prince Neurosciences Institute, Rhode Island Hospital, Providence, Rhode Island, USA;

4. Department of Neuroscience, Brown University, Providence, Rhode Island, USA;

5. Department of Neuroscience, Carney Institute for Brain Science, Brown University, Providence, Rhode Island, USA

Abstract

BACKGROUND AND OBJECTIVES: Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination. METHODS: The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ2, Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics. RESULTS: ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent (P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone. CONCLUSION: LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Subject

Neurology (clinical),Surgery

Reference7 articles.

1. On chatbots and generative artificial intelligence;Oermann;Neurosurgery.,2023

2. How to develop machine learning models for healthcare;Chen;Nat Mater.,2019

3. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models;Kung;PLOS Digit Health.,2023

4. Study behaviors and USMLE step 1 performance: implications of a student self-directed parallel curriculum;Burk-Rafel;Acad Med.,2017

5. A deep learning system for differential diagnosis of skin diseases;Liu;Nat Med.,2020

Cited by 72 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam;Journal of Shoulder and Elbow Surgery;2024-09

2. A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study;JMIR Medical Education;2024-08-16

3. Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study;JMIR Medical Education;2024-08-13

4. Correctness Comparison of ChatGPT‐4, Gemini, Claude‐3, and Copilot for Spatial Tasks;Transactions in GIS;2024-08-12

5. Performance of GPT-4 in Oral and Maxillofacial Surgery Board Exams: Challenges in Specialized Questions;2024-08-10