Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test-Reference-Cited by-同舟云学术

Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test

Published:2024-08-06 Issue:9 Volume:57 Page:
ISSN:1573-7462
Container-title:Artificial Intelligence Review
language:en
Short-container-title:Artif Intell Rev

Author:

Moglia Andrea,Georgiou Konstantinos,Cerveri Pietro,Mainardi Luca,Satava Richard M.,Cuschieri Alfred

Abstract

AbstractLarge language models (LLMs) have the intrinsic potential to acquire medical knowledge. Several studies assessing LLMs on medical examinations have been published. However, there is no reported evidence on tests related to robot-assisted surgery. The aims of this study were to perform the first systematic review of LLMs on medical examinations and to establish whether ChatGPT, GPT-4, and Bard can pass the Fundamentals of Robotic Surgery (FRS) didactic test. A literature search was performed on PubMed, Web of Science, Scopus, and arXiv following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) approach. A total of 45 studies were analyzed. GPT-4 passed several national qualifying examinations with questions in English, Chinese, and Japanese using zero-shot and few-shot learning. Med-PaLM 2 obtained similar scores on the United States Medical Licensing Examination with more refined prompt engineering techniques. Five different 2023 releases of ChatGPT, one of GPT-4, and one of Bard were tested on FRS. Seven attempts were performed with each release. The pass score was 79.5%. ChatGPT achieved a mean score of 64.6%, 65.6%, 75.0%, 78.9%, and 72.7% respectively from the first to the fifth tested release on FRS vs 91.5% of GPT-4 and 79.5% of Bard. GPT-4 outperformed ChatGPT and Bard in all corresponding attempts with a statistically significant difference for ChatGPT (p < 0.001), but not Bard (p = 0.002). Our findings agree with other studies included in this systematic review. We highlighted the potential and challenges of LLMs to transform the education of healthcare professionals in the different stages of learning, by assisting teachers in the preparation of teaching contents, and trainees in the acquisition of knowledge, up to becoming an assessment framework of leaners.

Funder

Politecnico di Milano

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10462-024-10849-5.pdf

Reference137 articles.

1. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Heal PM, Latifi S, Aziz S, Damseh R, Alabed Alrazak S, Sheikh J (2023) Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ 9:e48291

2. Abi-Rafeh J et al (2023) Complications following facelift and neck lift: implementation and assessment of large language model and artificial intelligence (ChatGPT) performance across 16 simulated patient presentations. Aesthetic Plast Surg 47(6):2407–2414. https://doi.org/10.1007/s00266-023-03538-1

3. Agarwal M et al (2023) Analysing the applicability of ChatGPT, bard, and Bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus 15(6):e40977–e40977. https://doi.org/10.7759/cureus.40977

4. Alanzi TM (2023) Impact of ChatGPT on teleconsultants in healthcare: perceptions of healthcare experts in Saudi Arabia. J Multidiscip Healthc 16:2309–2321. https://doi.org/10.2147/JMDH.S419847

5. Alayrac, J, Donahue J, Luc P (2022) Flamingo: a visual language model for fewshot learning. arXiv:2204.14198. https://doi.org/10.48550/arXiv.2204.14198