Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine-Reference-Cited by-同舟云学术

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine

Published:2024-07-23 Issue:1 Volume:7 Page:
ISSN:2398-6352
Container-title:npj Digital Medicine
language:en
Short-container-title:npj Digit. Med.

Author:

Jin Qiao^ORCID,Chen Fangyuan^ORCID,Zhou Yiliang^ORCID,Xu Ziyang,Cheung Justin M.,Chen Robert,Summers Ronald M.,Rousseau Justin F.^ORCID,Ni Peiyun,Landsman Marc J.^ORCID,Baxter Sally L.^ORCID,Al’Aref Subhi J.^ORCID,Li Yijia^ORCID,Chen Alexander^ORCID,Brejt Josef A.,Chiang Michael F.^ORCID,Peng Yifan^ORCID,Lu Zhiyong^ORCID

Abstract

AbstractRecent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V’s rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges—an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V’s high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Funder

U.S. Department of Health & Human Services | National Institutes of Health

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41746-024-01185-7.pdf

Reference18 articles.

1. OpenAI. GPT-4 Technical Report. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.08774 (2023).

2. Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinforma. 25, bbad493 (2024).

3. Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 6, 158 (2023).

4. Jin, Q., Leaman, R. & Lu, Z. Retrieve, summarize, and verify: how will ChatGPT affect information seeking from the medical literature? J. Am. Soc. Nephrol. 34, 1302-1304 (2023).

5. Jin, Q., Leaman, R. & Lu, Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 100, 104988 (2024).

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Large language models and dermoscopy: Assessing the potential of task‐specific GPT‐4 vision in diagnosing basal cell carcinoma;Journal of the European Academy of Dermatology and Venereology;2024-09-11

2. GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology;Cureus;2024-08-31

3. Ethical considerations for large language models in ophthalmology;Current Opinion in Ophthalmology;2024-08-27

4. From Revisions to Insights: Converting Radiology Report Revisions into Actionable Educational Feedback Using Generative AI Models;Journal of Imaging Informatics in Medicine;2024-08-19

5. New Approach for Automated Explanation of Material Phenomena (AA6082) Using Artificial Neural Networks and ChatGPT;Applied Sciences;2024-08-09