BACKGROUND
There are several multimodal generative artificial intelligence (AI) systems, including ChatGPT-4 with vision, also known as ChatGPT-4V or ChatGPT-4Vision, accept image data with text data. However, the change in diagnostic accuracy of ChatGPT-4 by adding image data is unknown.
OBJECTIVE
We compared the diagnostic accuracy between ChatGPT-4 with vision, inputting text and image (intervention) and ChatGPT-4 without vision, inputting only text (control), for case descriptions derived by case reports.
METHODS
We used the dataset of case descriptions and final diagnoses derived from the American Journal of Case Reports published from January 2022 to March 2023. We also extracted the figures and tables mentioned in case descriptions as image data. We excluded non-diagnostics, pediatric, and case reports without figures or tables in their case descriptions. From the case descriptions and images, ChatGPT-4 with vision generated the differential-diagnosis lists. We compared the diagnostic accuracy by ChatGPT-4 without vision, which was inputted the same case descriptions without images. Two physicians independently evaluated whether the final diagnosis was included in the lists. Discrepancies were resolved by another physician.
RESULTS
A total of 363 case descriptions were included. The rate of final diagnoses within the top 10 differential-diagnosis lists generated by ChatGPT-4 with vision was 85.1% (309/363), which was not different compared to 87.9% (319/363) by ChatGPT-4 without vision (P=.33). The rate of final diagnoses as the top diagnosis generated by ChatGPT-4 with vision was 44.4% (161/363), inferior to 55.9% (203/363) by ChatGPT-4 without vision (P=.002).
CONCLUSIONS
The rates of final diagnoses within the differential-diagnosis lists generated by ChatGPT-4 with vision were not improved compared to those without vision. The rate of final diagnoses as the top diagnosis generated by ChatGPT-4 with vision was inferior to that without vision. These results suggest that a multimodal generative AI system, ChatGPT-4 with vision, mainly relies on the text data, even though it accepts image data for generating differentials. Multimodal generative AI systems should be further developed to improve diagnostic performance through better integration of clinical data before being utilized in medicine.
CLINICALTRIAL
Not applicable