Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases

Author:

Milad DanielORCID,Antaki FaresORCID,Milad Jason,Farah AndrewORCID,Khairy Thomas,Mikhail DavidORCID,Giguère Charles-Édouard,Touma SamirORCID,Bernstein Allison,Szigiato Andrei-AlexandruORCID,Nayman TaylorORCID,Mullie Guillaume AORCID,Duval RenaudORCID

Abstract

Background/aimsThis study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases.MethodsWe tested GPT-4 on 422Journal of the American Medical AssociationOphthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort.ResultsUsing PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020).ConclusionImproved prompting enhances GPT-4’s performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.

Publisher

BMJ

Reference27 articles.

1. Large language models and their impact in Ophthalmology;Betzler;Lancet Digit Health,2023

2. New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology

3. Brown T , Mann B , Ryder N , et al . Language models are few-shot learners. In: Advances in Neural Information Processing Systems [Internet. Curran Associates, Inc, 2020: 1877–901.

4. OpenAI . GPT-4 technical report. arXiv, 2023, Available: http://arxiv.org/abs/2303.08774 [Accessed 24 Oct 2023].

5. Rao A , Pang M , Kim J , et al . Assessing the utility of Chatgpt throughout the entire clinical Workflow: development and usability study. J Med Internet Res 2023;25:e48659. doi:10.2196/48659

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3