Abstract
AbstractBackground and AimVisual data from images is essential for many medical diagnoses. This study evaluates the performance of multimodal Large Language Models (LLMs) in integrating textual and visual information for diagnostic purposes.MethodsWe tested GPT-4o and Claude Sonnet 3.5 on 120 clinical vignettes with and without accompanying images. Each vignette included patient demographics, a chief complaint, and relevant medical history. Vignettes were paired with either clinical or radiological images from two sources: 100 images from the OPENi database and 20 images from recent NEJM challenges, ensuring they were not in the LLMs’ training sets. Three primary care physicians served as a human benchmark. We analyzed diagnostic accuracy and the models’ explanations for a subset of cases.ResultsLLMs outperformed physicians in text-only scenarios (GPT-4o: 70.8%, Claude Sonnet 3.5: 59.5%, Physicians: 39.5%). With image integration, all improved, but physicians showed the largest gain (GPT-4o: 84.5%, p<0.001; Claude Sonnet 3.5: 67.3%, p=0.060; Physicians: 78.8%, p<0.001). LLMs changed their explanations in 45-60% of cases when presented with images, demonstrating some level of visual data integration.ConclusionMultimodal LLMs show promise in medical diagnosis, with improved performance when integrating visual evidence. However, this improvement is inconsistent and smaller compared to physicians, indicating a need for enhanced visual data processing in these models.
Publisher
Cold Spring Harbor Laboratory
Reference19 articles.
1. The value of the physical examination in clinical practice: an international survey
2. The Impact of Multimodal Large Language Models on Health Care’s Future
3. MM-LLMs: Recent Advances in MultiModal Large Language Models [Internet]. [cited 2024 Aug 26]. Available from: https://arxiv.org/html/2401.13601v1
4. A Survey on Vision Transformer;IEEE Trans Pattern Anal Mach Intell,2023
5. The application of multimodal large language models in medicine;Lancet Reg Health West Pac,2024