Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology-Reference-Cited by-同舟云学术

Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology

Published:2023-10-29 Issue:1 Volume:13 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Taloni Andrea,Borselli Massimiliano,Scarsi Valentina,Rossi Costanza,Coco Giulia,Scorcia Vincenzo,Giannaccare Giuseppe^ORCID

Abstract

AbstractTo compare the performance of humans, GPT-4.0 and GPT-3.5 in answering multiple-choice questions from the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) self-assessment program, available at https://www.aao.org/education/self-assessments. In June 2023, text-based multiple-choice questions were submitted to GPT-4.0 and GPT-3.5. The AAO provides the percentage of humans who selected the correct answer, which was analyzed for comparison. All questions were classified by 10 subspecialties and 3 practice areas (diagnostics/clinics, medical treatment, surgery). Out of 1023 questions, GPT-4.0 achieved the best score (82.4%), followed by humans (75.7%) and GPT-3.5 (65.9%), with significant difference in accuracy rates (always P < 0.0001). Both GPT-4.0 and GPT-3.5 showed the worst results in surgery-related questions (74.6% and 57.0% respectively). For difficult questions (answered incorrectly by > 50% of humans), both GPT models favorably compared to humans, without reaching significancy. The word count for answers provided by GPT-4.0 was significantly lower than those produced by GPT-3.5 (160 ± 56 and 206 ± 77 respectively, P < 0.0001); however, incorrect responses were longer (P < 0.02). GPT-4.0 represented a substantial improvement over GPT-3.5, achieving better performance than humans in an AAO BCSC self-assessment test. However, ChatGPT is still limited by inconsistency across different practice areas, especially when it comes to surgery.

Publisher

Springer Science and Business Media LLC

Subject

Multidisciplinary

Link

https://www.nature.com/articles/s41598-023-45837-2.pdf

Reference21 articles.

1. OpenAI. at https://openai.com/.

2. Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 11, 887 (2023).

3. Waisberg, E. et al. GPT-4 for triaging ophthalmic symptoms. Eye (Lond.) https://doi.org/10.1038/S41433-023-02595-9 (2023).

4. Rasmussen, M. L. R., Larsen, A. C., Subhi, Y. & Potapenko, I. Artificial intelligence-based ChatGPT chatbot responses for patient and parent questions on vernal keratoconjunctivitis. Graefes Arch. Clin. Exp. Ophthalmol. https://doi.org/10.1007/S00417-023-06078-1 (2023).

5. Potapenko, I. et al. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol. https://doi.org/10.1111/AOS.15661 (2023).

Cited by 36 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis;Journal of Medical Internet Research;2024-09-10

2. ChatGPT and large language models (LLMs) awareness and use. A prospective cross-sectional survey of U.S. medical students;PLOS Digital Health;2024-09-05

3. Large Language Models to Help Appeal Denied Radiotherapy Services;JCO Clinical Cancer Informatics;2024-09

4. Evaluating the effectiveness of large language models in patient education for conjunctivitis;British Journal of Ophthalmology;2024-08-30

5. Chatbot for the Return of Positive Genetic Screening Results for Hereditary Cancer Syndromes: a Prompt Engineering Study;2024-08-29