Comparison of the Diagnostic Performance from Patient’s Medical History and Imaging Findings between GPT-4 based ChatGPT and Radiologists in Challenging Neuroradiology Cases-Reference-Cited by-同舟云学术

Comparison of the Diagnostic Performance from Patient’s Medical History and Imaging Findings between GPT-4 based ChatGPT and Radiologists in Challenging Neuroradiology Cases

Published:2023-08-29 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Horiuchi Daisuke^ORCID,Tatekawa Hiroyuki^ORCID,Oura Tatsushi^ORCID,Oue Satoshi,Walston Shannon L^ORCID,Takita Hirotaka^ORCID,Matsushita Shu^ORCID,Mitsuyama Yasuhito^ORCID,Shimono Taro^ORCID,Miki Yukio^ORCID,Ueda Daiju^ORCID

Abstract

AbstractPurposeTo compare the diagnostic performance between Chat Generative Pre-trained Transformer (ChatGPT), based on the GPT-4 architecture, and radiologists from patient’s medical history and imaging findings in challenging neuroradiology cases.MethodsWe collected 30 consecutive “Freiburg Neuropathology Case Conference” cases from the journal Clinical Neuroradiology between March 2016 and June 2023. GPT-4 based ChatGPT generated diagnoses from the patient’s provided medical history and imaging findings for each case, and the diagnostic accuracy rate was determined based on the published ground truth. Three radiologists with different levels of experience (2, 4, and 7 years of experience, respectively) independently reviewed all the cases based on the patient’s provided medical history and imaging findings, and the diagnostic accuracy rates were evaluated. The Chi-square tests were performed to compare the diagnostic accuracy rates between ChatGPT and each radiologist.ResultsChatGPT achieved an accuracy rate of 23% (7/30 cases). Radiologists achieved the following accuracy rates: a junior radiology resident had 27% (8/30) accuracy, a senior radiology resident had 30% (9/30) accuracy, and a board-certified radiologist had 47% (14/30) accuracy. ChatGPT’s diagnostic accuracy rate was lower than that of each radiologist, although the difference was not significant (p= 0.99, 0.77, and 0.10, respectively).ConclusionThe diagnostic performance of GPT-4 based ChatGPT did not reach the performance level of either junior/senior radiology residents or board-certified radiologists in challenging neuroradiology cases. While ChatGPT holds great promise in the field of neuroradiology, radiologists should be aware of its current performance and limitations for optimal utilization.

Publisher

Cold Spring Harbor Laboratory

Reference29 articles.

1. Evaluating GPT-4-based ChatGPT’s Clinical Potential on the NEJM Quiz

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases;2024-06-25

2. Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases;Cureus;2024-05-09

3. Diagnostic Performance Comparison between Generative AI and Physicians: A Systematic Review and Meta-Analysis;2024-01-22

4. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases;2024-01-20

5. The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis (Preprint);2023-11-29