Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

Author:

Ali Rohaid1,Tang Oliver Y.1ORCID,Connolly Ian D.2,Fridley Jared S.1,Shin John H.2,Zadnik Sullivan Patricia L.1,Cielo Deus1,Oyelese Adetokunbo A.1,Doberstein Curtis E.1,Telfeian Albert E.1,Gokaslan Ziya L.1,Asaad Wael F.1345

Affiliation:

1. Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;

2. Department of Neurosurgery, Massachusetts General Hospital, Boston, Massachusetts, USA;

3. Norman Prince Neurosciences Institute, Rhode Island Hospital, Providence, Rhode Island, USA;

4. Department of Neuroscience, Brown University, Providence, Rhode Island, USA;

5. Carney Institute for Brain Science, Brown University, Providence, Rhode Island, USA

Abstract

BACKGROUND AND OBJECTIVES: General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation. METHODS: The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ2, Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics. RESULTS: On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P < .01), and GPT-4 outperformed GPT-3.5 (P = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all P < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, P = .042) and Bard (OR = 0.76, P = .014), but not GPT-4 (OR = 0.86, P = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, P = .044) and was comparable with Bard's (68.6% vs 66.7%, P = 1.000). However, GPT-4 demonstrated significantly lower rates of “hallucination” on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, P < .001) and Bard (2.3% vs 27.3%, P = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, P = .012) and Bard (OR = 2.09, P < .001). CONCLUSION: On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Subject

Neurology (clinical),Surgery

Reference4 articles.

1. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models;Kung;PLOS Digit Health,2023

2. ChatGPT goes to law school;Choi;SSRN Electron J,2023

3. Letter: Performance of ChatGPT and GPT-4 on neurosurgery written board examinations;Ali;medRxiv

4. Continuous improvement in patient safety and quality in neurological surgery: the American Board of Neurological Surgery in the past, present, and future;Wang;J Neurosurg,2021

Cited by 120 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3