BACKGROUND
Large language models (LLMs) have significantly transformed the field of natural language processing, with cutting-edge models like ChatGPT currently leading the way in medical AI.
OBJECTIVE
This study aimed to assess the performance of five distinct LLMs (GPT-3.5, ChatGPT-4, PaLM2, Claude 2, and SenseNova) in comparison to two human cohorts (a group of funduscopic disease experts and a group of ophthalmologists) on the specialized subject of funduscopic disease.
METHODS
Five distinct LLMs and two distinct human groups independently completed a 100-item funduscopic disease test. The performance of these entities was assessed by comparing their average scores, response stability, and answer confidence, thereby establishing a basis for evaluation.
RESULTS
Among all the LLMs, GPT-4 and PaLM2 exhibited the most substantial average correlation. Additionally, GPT-4 achieved the highest average score and demonstrated the utmost confidence during the exam. In comparison to human cohorts, GPT-4 exhibited comparable performance to ophthalmologists, albeit falling short of the expertise demonstrated by funduscopic disease specialists.
CONCLUSIONS
The study provided evidence of the exceptional performance of GPT-4 in the domain of funduscopic disease. With continued enhancements, validated LLMs have the potential to yield unforeseen advantages in enhancing healthcare for both patients and physicians.