Evaluating the effectiveness of large language models in patient education for conjunctivitis-Reference-Cited by-同舟云学术

Evaluating the effectiveness of large language models in patient education for conjunctivitis

Published:2024-08-30 Issue: Volume: Page:bjo-2024-325599
ISSN:0007-1161
Container-title:British Journal of Ophthalmology
language:en
Short-container-title:Br J Ophthalmol

Author:

Wang Jingyuan^ORCID,Shi Runhan,Le Qihua,Shan Kun,Chen Zhi^ORCID,Zhou Xujiao,He Yao,Hong Jiaxu

Abstract

AimsTo evaluate the quality of responses from large language models (LLMs) to patient-generated conjunctivitis questions.MethodsA two-phase, cross-sectional study was conducted at the Eye and ENT Hospital of Fudan University. In phase 1, four LLMs (GPT-4, Qwen, Baichuan 2 and PaLM 2) responded to 22 frequently asked conjunctivitis questions. Six expert ophthalmologists assessed these responses using a 5-point Likert scale for correctness, completeness, readability, helpfulness and safety, supplemented by objective readability analysis. Phase 2 involved 30 conjunctivitis patients who interacted with GPT-4 or Qwen, evaluating the LLM-generated responses based on satisfaction, humanisation, professionalism and the same dimensions except for correctness from phase 1. Three ophthalmologists assessed responses using phase 1 criteria, allowing for a comparative analysis between medical and patient evaluations, probing the study’s practical significance.ResultsIn phase 1, GPT-4 excelled across all metrics, particularly in correctness (4.39±0.76), completeness (4.31±0.96) and readability (4.65±0.59) while Qwen showed similarly strong performance in helpfulness (4.37±0.93) and safety (4.25±1.03). Baichuan 2 and PaLM 2 were effective but trailed behind GPT-4 and Qwen. The objective readability analysis revealed GPT-4’s responses as the most detailed, with PaLM 2’s being the most succinct. Phase 2 demonstrated GPT-4 and Qwen’s robust performance, with high satisfaction levels and consistent evaluations from both patients and professionals.ConclusionsOur study showed LLMs effectively improve patient education in conjunctivitis. These models showed considerable promise in real-world patient interactions. Despite encouraging results, further refinement, particularly in personalisation and handling complex inquiries, is essential prior to the clinical integration of these LLMs.

Funder

National Natural Science Foundation of China

Research and Development Program of China

Shanghai Medical Innovation Research Program

Shanghai Key Clinical Research Program

Publisher

BMJ

Reference29 articles.

1. Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam;Tsoutsanis;Comput Biol Med,2024

2. Large language models encode clinical knowledge;Singhal;Nature New Biol,2023

3. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

4. Accuracy and Reliability of Chatbot Responses to Physician Questions;Goodman;JAMA Netw Open,2023

5. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries;Pushpanathan;i Sci,2023