A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients-Reference-Cited by-同舟云学术

A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Published:2024-02 Issue:1 Volume:3 Page:3-18
ISSN:2771-1749
Container-title:Health Care Science
language:en
Short-container-title:Health Care Science

Author:

Sarvari Peter¹^ORCID,Al‐fagih Zaid¹,Ghuwel Abdullatif²,Al‐fagih Othman²

Affiliation:

1. Rhazes AI London UK

2. National Health Service England London UK

Abstract

AbstractBackgroundGiven the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT‐4 and PaLM2. Small‐scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT‐4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.MethodsTo fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully‐written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.ResultsBased on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT‐4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT‐4 correctly identified 1116 unique diagnoses.ConclusionThe results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/hcs2.79

Reference26 articles.

1. Rate of diagnostic errors and serious misdiagnosis-related harms for major vascular events, infections, and cancers: toward a national incidence estimate using the “Big Three”

2. Misdiagnosis: analysis based on case record review with proposals aimed to improve diagnostic processes

3. Medical error—the third leading cause of death in the US

4. National Costs Of The Medical Liability System

5. How much diagnostic safety can we afford, and how should we decide? A health economics perspective