Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT-Reference-Cited by-同舟云学术

Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT

Published:2023-12-30 Issue:1 Volume:14 Page:90
ISSN:2075-4418
Container-title:Diagnostics
language:en
Short-container-title:Diagnostics

Author:

Lee Kyu Hong¹^ORCID,Lee Ro Woon¹^ORCID,Kwon Ye Eun¹

Affiliation:

1. Department of Radiology, College of Medicine, Inha University, Incheon 22212, Republic of Korea

Abstract

This study evaluates the diagnostic accuracy and clinical utility of two artificial intelligence (AI) techniques: Kakao Brain Artificial Neural Network for Chest X-ray Reading (KARA-CXR), an assistive technology developed using large-scale AI and large language models (LLMs), and ChatGPT, a well-known LLM. The study was conducted to validate the performance of the two technologies in chest X-ray reading and explore their potential applications in the medical imaging diagnosis domain. The study methodology consisted of randomly selecting 2000 chest X-ray images from a single institution’s patient database, and two radiologists evaluated the readings provided by KARA-CXR and ChatGPT. The study used five qualitative factors to evaluate the readings generated by each model: accuracy, false findings, location inaccuracies, count inaccuracies, and hallucinations. Statistical analysis showed that KARA-CXR achieved significantly higher diagnostic accuracy compared to ChatGPT. In the ‘Acceptable’ accuracy category, KARA-CXR was rated at 70.50% and 68.00% by two observers, while ChatGPT achieved 40.50% and 47.00%. Interobserver agreement was moderate for both systems, with KARA at 0.74 and GPT4 at 0.73. For ‘False Findings’, KARA-CXR scored 68.00% and 68.50%, while ChatGPT scored 37.00% for both observers, with high interobserver agreements of 0.96 for KARA and 0.97 for GPT4. In ‘Location Inaccuracy’ and ‘Hallucinations’, KARA-CXR outperformed ChatGPT with significant margins. KARA-CXR demonstrated a non-hallucination rate of 75%, which is significantly higher than ChatGPT’s 38%. The interobserver agreement was high for KARA (0.91) and moderate to high for GPT4 (0.85) in the hallucination category. In conclusion, this study demonstrates the potential of AI and large-scale language models in medical imaging and diagnostics. It also shows that in the chest X-ray domain, KARA-CXR has relatively higher accuracy than ChatGPT.

Funder

KakaoBrain

Publisher

MDPI AG

Subject

Clinical Biochemistry

Link

https://www.mdpi.com/2075-4418/14/1/90/pdf

Reference29 articles.

1. Artificial intelligence in healthcare: Complementing, not replacing, doctors and healthcare providers;Sezgin;Digit. Health,2023

2. The imperative for regulatory oversight of large language models (or generative AI) in healthcare;Topol;Npj Digit. Med.,2023

3. Yang, H., Li, J., Liu, S., Du, L., Liu, X., Huang, Y., Shi, Q., and Liu, J. (2023). Exploring the Potential of Large Language Models in Personalized Diabetes Treatment Strategies. medRxiv.

4. Omiye, J.A., Lester, J., Spichak, S., Rotemberg, V., and Daneshjou, R. (2023). Beyond the hype: Large language models propagate race-based medicine. medRxiv.

5. Future of the language models in healthcare: The role of chatbot;Tustumi;Arq. Bras. De Cir. Dig.,2023

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ChatGPT’s Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type;Diagnostics;2024-01-12