Abstract
Background This study presents a comprehensive evaluation of the performance of various language models in generating responses for ophthalmology emergencies and compares their accuracy with the established NHS 111 online Triage system.Methods We included 21 ophthalmology related emergency scenario questions from the 111 triaging algorithm. These questions were based on four different ophthalmology emergency themes as laid out in the NHS 111 algorithm. The responses generated from NHS 111 online, were compared to the different LLM-chatbots responses. We included a range of models including ChatGPT-3.5, Google Bard, Bing Chat, and ChatGPT-4.0. The accuracy of each LLM-chatbot response was compared against the NHS 111 Triage using a two prompt strategy. Answers were graded separately by two different authors as following: −2 graded as “Very poor”, -1 as “Poor”, 0 as “No response”, 1 as “Good”, 2 as “Very good” and 3 graded as “Excellent”.Results Overall score of ≥ 1 graded as “Good” or better was achieved by 93% of responses of all LLMs. This refers to at least part of the answer having correct information and partially matching NHS 111 response, as well as the absence of any wrong information or advice which is potentially harmful to the patient’s health.Conclusions The high accuracy and safety observed in LLM responses support their potential as effective tools for providing timely information and guidance to patients. While further research is warranted to validate these findings in clinical practice, LLMs hold promise in enhancing patient care and healthcare accessibility in the digital age.