The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model-Reference-Cited by-同舟云学术

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

Published:2023-02-01 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Levine David M,Tuwani Rudraksh,Kompa Benjamin,Varma Amita,Finlayson Samuel G.,Mehrotra Ateev,Beam Andrew

Abstract

ABSTRACTImportanceArtificial intelligence (AI) applications in health care have been effective in many areas of medicine, but they are often trained for a single task using labeled data, making deployment and generalizability challenging. Whether a general-purpose AI language model can perform diagnosis and triage is unknown.ObjectiveCompare the general-purpose Generative Pre-trained Transformer 3 (GPT-3) AI model’s diagnostic and triage performance to attending physicians and lay adults who use the Internet.DesignWe compared the accuracy of GPT-3’s diagnostic and triage ability for 48 validated case vignettes of both common (e.g., viral illness) and severe (e.g., heart attack) conditions to lay people and practicing physicians. Finally, we examined how well calibrated GPT-3’s confidence was for diagnosis and triage.Setting and ParticipantsThe GPT-3 model, a nationally representative sample of lay people, and practicing physicians.ExposureValidated case vignettes (<60 words; <6thgrade reading level).Main Outcomes and MeasuresCorrect diagnosis, correct triage.ResultsAmong all cases, GPT-3 replied with the correct diagnosis in its top 3 for 88% (95% CI, 75% to 94%) of cases, compared to 54% (95% CI, 53% to 55%) for lay individuals (p<0.001) and 96% (95% CI, 94% to 97%) for physicians (p=0.0354). GPT-3 triaged (71% correct; 95% CI, 57% to 82%) similarly to lay individuals (74%; 95% CI, 73% to 75%; p=0.73); both were significantly worse than physicians (91%; 95% CI, 89% to 93%; p<0.001). As measured by the Brier score, GPT-3 confidence in its top prediction was reasonably well-calibrated for diagnosis (Brier score = 0.18) and triage (Brier score = 0.22).Conclusions and RelevanceA general-purpose AI language model without any content-specific training could perform diagnosis at levels close to, but below physicians and better than lay individuals. The model was performed less well on triage, where its performance was closer to that of lay individuals.

Publisher

Cold Spring Harbor Laboratory

Reference43 articles.

1. The potential of artificial intelligence to improve patient safety: a scoping review

2. Artificial Intelligence Based on Machine Learning in Pharmacovigilance: A Scoping Review

3. Artificial Intelligence in Health Care

4. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. National Academy of Medicine. Una reseña

5. Big Data and Machine Learning in Health Care

Cited by 56 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Statistical refinement of case vignettes for digital health research;2024-08-30

2. The AI Future of Emergency Medicine;Annals of Emergency Medicine;2024-08

3. Evaluation of responses to cardiac imaging questions by the artificial intelligence large language model ChatGPT;Clinical Imaging;2024-08

4. Using conversant artificial intelligence to improve diagnostic reasoning: ready for prime time?;Medical Journal of Australia;2024-07-31

5. Emergency Patient Triage Improvement through a Retrieval-Augmented Generation Enhanced Large-Scale Language Model;Prehospital Emergency Care;2024-07-11