BACKGROUND
Large Language Models (LLM) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text.
OBJECTIVE
In response to this rapidly progressing field, we aimed to establish a baseline performance and quality standard for the current generation of LLMs in narrative medical response tasks.
METHODS
We quantified the accuracy of LLMs in responding to healthcare examination questions, and evaluated the consistency and quality of study reporting. The protocol was registered with OSF (https://osf.io/xqzkw). The search included all papers up until 09/10/2023, at which point a preliminary search was conducted and piloting of study selection process was commenced using MEDLINE, Embase, Global Health, Cochrane Library and Health Technology Assessment Database, alongside the OVID search interface. The literature search included the following MeSH terms used in all possible combinations: ‘artificial intelligence’, ‘ChatGPT’, ‘GPT’, ‘LLM’, ‘Large Language Model’, ‘machine learning’, ‘neural network’, ‘Generative Pre-trained Transformer’, ‘Generative Transformer’, ‘Generative Language Model’, ‘Generative Model’, ‘medical exam’, ‘healthcare exam’ ‘clinical exam’. Sensitivity, accuracy and precision data was extracted, including the relevant confidence intervals.
RESULTS
The search identified 1673 relevant citations. After removing duplicate results, 1268 articles were screened for titles and abstracts, and 32 studies were included for full-text review. Our meta-analysis suggests that LLMs are able to perform with an overall medical exam accuracy of 0.61 (CI 0.58, 0.64), an LLM on the USMLE accuracy of 0.51 (CI 0.46, 0.56), and a ChatGPT on medical exams overall accuracy of 0.64 (CI 0.6, 0.67).
CONCLUSIONS
For policy and deployment decisions about Large Language Models to advance healthcare, we propose a new framework called RUBRICC - Regulatory, Usability, Bias, Reliability (Evidence & Safety), Interoperability, Cost, & Co-design-PPIE. This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services whilst respecting patient safety considerations.
CLINICALTRIAL
OSF (https://osf.io/xqzkw)