BACKGROUND
The global incidence of blindness has continued to increase, despite the enactment of a Global Eye Health Action Plan by the World Health Assembly. This can be attributed, in part to an aging population, but also to the limited diagnostic resources within lower and middle income countries (LMICs). The advent of Artificial Intelligence (AI) within healthcare could pose a novel solution to combating the prevalence of blindness globally.
OBJECTIVE
The study aimed to establish if a complex prompt altered the diagnostic accuracy of common ophthalmological conditions by GPT-4 and quantify potential differences in performance.
METHODS
Two AI models (gpt-4-0125-preview and an altered version of the Alan super prompt running on gpt-4-0125-preview) were instructed to diagnose the condition present in 12 clinical vignettes. The vignettes comprised of five prevalent adult conditions, five prevalent childhood conditions and two control cases – one adult orientated and one child orientated. Through prompt engineering, the AI models were “forced” to solely provide the name of the diagnosis. Each vignette was presented to each model 100 times.
The data then underwent statistical analysis. A Chi-Square Test of Independence compared the total true positives of the all the conditions between the two models. Additionally, statistical screening metrics– sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) – were used to determined accuracy of each model.
RESULTS
There was a significant difference between the AI models when analysing the total number of true positives for the conditions investigated (X2=428.86 and P=9.446e-87). The altered Alan super prompt performed at an increased rate for all conditions except retinopathy of prematurity (ROP) when compared to gpt-4-0125-preview.
CONCLUSIONS
The study established that overall, the inclusion of a complex prompt positively affected the diagnostic accuracy of gpt-4-0125-preview. The greatest difference in the performance of the models was observable in conditions more prominent in LMICs. The results highlighted the potential impact that Alan could have on healthcare systems within LMICs as an augmentation of the medical diagnostic process. Although additional refinement is required to the altered Alan super prompt, the implementation of AI applications in healthcare systems within LMICs could improve patient outcomes in these regions.