A Systematic Review and Meta-Analysis of the Accuracy and Capability of Artificial Intelligence Solutions in Healthcare Exams and Certificates (Preprint)-Reference-Cited by-同舟云学术

A Systematic Review and Meta-Analysis of the Accuracy and Capability of Artificial Intelligence Solutions in Healthcare Exams and Certificates (Preprint)

Published:2024-01-18 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Waldock William Joel^ORCID,Zhang Joe^ORCID,Guni Ahmad^ORCID,Nabeel Ahmad,Darzi Ara^ORCID,Ashrafian Hutan^ORCID

Abstract

BACKGROUND

Large Language Models (LLM) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text.

OBJECTIVE

In response to this rapidly progressing field, we aimed to establish a baseline performance and quality standard for the current generation of LLMs in narrative medical response tasks.

METHODS

We quantified the accuracy of LLMs in responding to healthcare examination questions, and evaluated the consistency and quality of study reporting. The protocol was registered with OSF (https://osf.io/xqzkw). The search included all papers up until 09/10/2023, at which point a preliminary search was conducted and piloting of study selection process was commenced using MEDLINE, Embase, Global Health, Cochrane Library and Health Technology Assessment Database, alongside the OVID search interface. The literature search included the following MeSH terms used in all possible combinations: ‘artificial intelligence’, ‘ChatGPT’, ‘GPT’, ‘LLM’, ‘Large Language Model’, ‘machine learning’, ‘neural network’, ‘Generative Pre-trained Transformer’, ‘Generative Transformer’, ‘Generative Language Model’, ‘Generative Model’, ‘medical exam’, ‘healthcare exam’ ‘clinical exam’. Sensitivity, accuracy and precision data was extracted, including the relevant confidence intervals.

RESULTS

The search identified 1673 relevant citations. After removing duplicate results, 1268 articles were screened for titles and abstracts, and 32 studies were included for full-text review. Our meta-analysis suggests that LLMs are able to perform with an overall medical exam accuracy of 0.61 (CI 0.58, 0.64), an LLM on the USMLE accuracy of 0.51 (CI 0.46, 0.56), and a ChatGPT on medical exams overall accuracy of 0.64 (CI 0.6, 0.67).

CONCLUSIONS

For policy and deployment decisions about Large Language Models to advance healthcare, we propose a new framework called RUBRICC - Regulatory, Usability, Bias, Reliability (Evidence & Safety), Interoperability, Cost, & Co-design-PPIE. This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services whilst respecting patient safety considerations.

CLINICALTRIAL

OSF (https://osf.io/xqzkw)

Publisher

JMIR Publications Inc.

Reference57 articles.

1. Generative adversarial networks and synthetic patient data: current challenges and future perspectives

2. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns

3. Large language models and the perils of their hallucinations

4. The shaky foundations of large language models and foundation models for electronic health records

5. Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement