Abstract
AbstractBackgroundLarge language models (LLMs) like GPT-4 demonstrate promising capabilities in medical image analysis, but their practical utility is hindered by substantial misdiagnosis rates ranging from 30-50%.PurposeTo improve the diagnostic accuracy of GPT-4 Turbo in neuroradiology cases using prompt engineering strategies, thereby reducing misdiagnosis rates.Materials and MethodsWe employed 751 publicly available neuroradiology cases from the American Journal of Neuroradiology Case of the Week Archives. Prompt instructions guided GPT-4 Turbo to analyze clinical and imaging data, generating a list of five candidate diagnoses with confidence levels. Strategies included role adoption as an imaging expert, step-by-step reasoning, and confidence assessment.ResultsWithout any adjustments, the baseline accuracy of GPT-4 Turbo was 55.1% to correctly identify the top diagnosis, with a misdiagnosis rate of 29.4%. Considering the five candidates’ improved applicability, it is 70.6%. Applying a 90% confidence threshold increased the accuracy of the top diagnosis to 72.9% and the applicability of the five candidates to 85.9%, while reducing misdiagnoses to 14.1%, but limited the analysis to half of cases.ConclusionPrompt engineering strategies with confidence level thresholds demonstrated the potential to reduce misdiagnosis rates in neuroradiology cases analyzed by GPT-4 Turbo. This research paves the way for enhancing the feasibility of AI-assisted diagnostic imaging, where AI suggestions can contribute to human decision-making processes. However, the study lacks analysis of real-world clinical data. This highlights the need for further investigation in various specialties and medical modalities to optimize thresholds that balance diagnostic accuracy and practical utility.
Publisher
Cold Spring Harbor Laboratory