Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases-Reference-Cited by-同舟云学术

Impact of Multimodal Prompt Elements on Diagnostic Performance of GPT-4(V) in Challenging Brain MRI Cases

Published:2024-03-06 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Schramm Severin,Preis Silas,Metz Marie-Christin,Jung Kirsten,Schmitz-Koep Benita,Zimmer Claus,Wiestler Benedikt^ORCID,Hedderich Dennis M.,Kim Su Hwan^ORCID

Abstract

AbstractBackgroundRecent studies have explored the application of multimodal large language models (LLMs) in radiological differential diagnosis. Yet, how different multimodal input combinations affect diagnostic performance is not well understood.PurposeTo evaluate the impact of varying multimodal input elements on the accuracy of GPT-4(V)-based brain MRI differential diagnosis.MethodsThirty brain MRI cases with a challenging yet verified diagnosis were selected. Seven prompt groups with variations of four input elements (image, image annotation, medical history, image description) were defined. For each MRI case and prompt group, three identical queries were performed using an LLM-based search engine (© PerplexityAI, powered by GPT-4(V)). Accuracy of LLM-generated differential diagnoses was rated using a binary and a numeric scoring system and analyzed using a chi-square test and a Kruskal-Wallis test. Results were corrected for false discovery rate employing the Benjamini-Hochberg procedure. Regression analyses were performed to determine the contribution of each individual input element to diagnostic performance.ResultsThe prompt group containing an annotated image, medical history, and image description as input exhibited the highest diagnostic accuracy (67.8% correct responses). Significant differences were observed between prompt groups, especially between groups that contained the image description among their inputs, and those that did not. Regression analyses confirmed a large positive effect of the image description on diagnostic accuracy (p ≪ 0.001), as well as a moderate positive effect of the medical history (p < 0.001). The presence of unannotated or annotated images had only minor or insignificant effects on diagnostic accuracy.ConclusionThe textual description of radiological image findings was identified as the strongest contributor to performance of GPT-4(V) in brain MRI differential diagnosis, followed by the medical history. The unannotated or annotated image alone yielded very low diagnostic performance. These findings offer guidance on the effective utilization of multimodal LLMs in clinical practice.

Publisher

Cold Spring Harbor Laboratory

Reference22 articles.

1. Preliminary assessment of automated radiology report generation with generative pre-trained transformers: comparing results to radiologist-generated reports

2. Hyland SL , Bannur S , Bouzid K , et al. MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv preprint. 2023; https://arxiv.org/abs/2311.13668v1. Accessed January 14, 2024.

3. Lu Y , Hong S , Shah Y , Xu P. Effectively Fine-tune to Improve Large Multimodal Models for Radiology Report Generation. arXiv preprint. 2023; https://arxiv.org/abs/2312.01504v1. Accessed January 14, 2024.

4. Diagnostic Performance of ChatGPT from Patient History and Imaging Findings on the Diagnosis Please Quizzes

5. Radiological Differential Diagnoses Based on Cardiovascular and Thoracic Imaging Patterns: Perspectives of Four Large Language Models

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance of Open-Source LLMs in Challenging Radiological Cases – A Benchmark Study on 4,049 Eurorad Case Reports;2024-09-06

2. Boosting LLM-Assisted Diagnosis: 10-Minute LLM Tutorial Elevates Radiology Residents’ Performance in Brain MRI Interpretation;2024-07-05