Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study-Reference-Cited by-同舟云学术

Influence of a Large Language Model on Diagnostic Reasoning: A Randomized Clinical Vignette Study

Published:2024-03-14 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Goh Ethan^ORCID,Gallo Robert^ORCID,Hom Jason^ORCID,Strong Eric^ORCID,Weng Yingjie^ORCID,Kerman Hannah^ORCID,Cool Josephine^ORCID,Kanjee Zahir^ORCID,Parsons Andrew S.^ORCID,Ahuja Neera^ORCID,Horvitz Eric^ORCID,Yang Daniel,Milstein Arnold^ORCID,Olson Andrew P.J^ORCID,Rodman Adam^ORCID,Chen Jonathan H^ORCID

Abstract

ABSTRACTImportanceDiagnostic errors are common and cause significant morbidity. Large language models (LLMs) have shown promise in their performance on both multiple-choice and open-ended medical reasoning examinations, but it remains unknown whether the use of such tools improves diagnostic reasoning.ObjectiveTo assess the impact of the GPT-4 LLM on physicians’ diagnostic reasoning compared to conventional resources.DesignMulti-center, randomized clinical vignette study.SettingThe study was conducted using remote video conferencing with physicians across the country and in-person participation across multiple academic medical institutions.ParticipantsResident and attending physicians with training in family medicine, internal medicine, or emergency medicine.Intervention(s)Participants were randomized to access GPT-4 in addition to conventional diagnostic resources or to just conventional resources. They were allocated 60 minutes to review up to six clinical vignettes adapted from established diagnostic reasoning exams.Main Outcome(s) and Measure(s)The primary outcome was diagnostic performance based on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis.Results50 physicians (26 attendings, 24 residents) participated, with an average of 5.2 cases completed per participant. The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI -4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of -82 seconds (95% CI -195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.Conclusions and RelevanceIn a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

Publisher

Cold Spring Harbor Laboratory

Reference42 articles.

1. Changes in Rates of Autopsy-Detected Diagnostic Errors Over Time

2. Types and Origins of Diagnostic Errors in Primary Care Settings

3. Diagnostic Errors in Hospitalized Adults Who Died or Were Transferred to Intensive Care

4. Improving Diagnosis in Health Care

5. Diagnostic Errors in the Emergency Department: A Systematic Review

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial;2024-08-07

2. Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience;2024-07-26

3. Effects of interacting with a large language model compared with a human coach on the clinical diagnostic process and outcomes among fourth-year medical students: study protocol for a prospective, randomised experiment using patient vignettes;BMJ Open;2024-07

4. Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician;Asia-Pacific Journal of Ophthalmology;2024-07