Abstract
Objective: To evaluate ChatGPT’s ability to identify more than mild diabetic retinopathy (mtmDR) and vision-threatening diabetic retinopathy (VTDR) from single fundus images.
Methods: Sixty randomly selected images in equal proportions between normal, mild nonproliferative DR (NPDR), moderate NPDR, severe NPDR or proliferative DR (PDR), blur fundus without PDR, and blur fundus with PDR were utilized from a license-free, publicly available database. Each image was submitted to ChatGPT three times with a standardized prompt regarding mtmDR and VTDR, and its response was recorded. The images were also presented in randomized order to a panel of retina specialists who identified images as readable or unreadable, and potentially as mtmDR or VTDR. The retina specialists’ majority response was considered the gold standard.
Results: ChatGPT was able to read 132/180 (73.33%) of the image prompts, while retina specialists read 158/180 prompts (87.7%) with excellent interrater reliability. For mtmDR, ChatGPT demonstrated a sensitivity of 96.2%, specificity of 19.1%, positive predictive value (PPV) of 69.1%, and negative predictive value (NPV) of 72.7%. 110/121 (90.9%) of prompts read by ChatGPT were labeled as mtmDR. For VTDR, ChatGPT demonstrated a sensitivity of 63.0%, specificity of 62.5%, PPV of 71.9%, and NPV of 52.6% compared to the retina specialist consensus. ChatGPT labeled 69/121 (57.0%) of images as VTDR and mislabeled 27/90 (30.0%) of non-VTDR images as VTDR.
Conclusion: ChatGPT demonstrated a modest sensitivity and specificity in the differentiation of mtmDR and VTDR compared to retina specialists.