A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients

Author:

Sarvari Peter1ORCID,Al‐fagih Zaid1,Ghuwel Abdullatif2,Al‐fagih Othman2

Affiliation:

1. Rhazes AI London UK

2. National Health Service England London UK

Abstract

AbstractBackgroundGiven the strikingly high diagnostic error rate in hospitals, and the recent development of Large Language Models (LLMs), we set out to measure the diagnostic sensitivity of two popular LLMs: GPT‐4 and PaLM2. Small‐scale studies to evaluate the diagnostic ability of LLMs have shown promising results, with GPT‐4 demonstrating high accuracy in diagnosing test cases. However, larger evaluations on real electronic patient data are needed to provide more reliable estimates.MethodsTo fill this gap in the literature, we used a deidentified Electronic Health Record (EHR) data set of about 300,000 patients admitted to the Beth Israel Deaconess Medical Center in Boston. This data set contained blood, imaging, microbiology and vital sign information as well as the patients' medical diagnostic codes. Based on the available EHR data, doctors curated a set of diagnoses for each patient, which we will refer to as ground truth diagnoses. We then designed carefully‐written prompts to get patient diagnostic predictions from the LLMs and compared this to the ground truth diagnoses in a random sample of 1000 patients.ResultsBased on the proportion of correctly predicted ground truth diagnoses, we estimated the diagnostic hit rate of GPT‐4 to be 93.9%. PaLM2 achieved 84.7% on the same data set. On these 1000 randomly selected EHRs, GPT‐4 correctly identified 1116 unique diagnoses.ConclusionThe results suggest that artificial intelligence (AI) has the potential when working alongside clinicians to reduce cognitive errors which lead to hundreds of thousands of misdiagnoses every year. However, human oversight of AI remains essential: LLMs cannot replace clinicians, especially when it comes to human understanding and empathy. Furthermore, a significant number of challenges in incorporating AI into health care exist, including ethical, liability and regulatory barriers.

Publisher

Wiley

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3