Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases (Preprint)-Reference-Cited by-同舟云学术

Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases (Preprint)

Published:2024-04-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Hirosawa Takanobu^ORCID,Harada Yukinori^ORCID,Mizuta Kazuya^ORCID,Sakamoto Tetsu^ORCID,Tokumasu Kazuki^ORCID,Shimizu Taro^ORCID

Abstract

BACKGROUND

The potential of artificial intelligence (AI) chatbots, particularly ChatGPT with GPT-4 (OpenAI), in assisting with medical diagnosis is an emerging research area. However, it is not yet clear how well AI chatbots can evaluate whether the final diagnosis is included in differential diagnosis lists.

OBJECTIVE

This study aims to assess the capability of GPT-4 in identifying the final diagnosis from differential-diagnosis lists and to compare its performance with that of physicians for case report series.

METHODS

We used a database of differential-diagnosis lists from case reports in the <i>American Journal of Case Reports</i>, corresponding to final diagnoses. These lists were generated by 3 AI systems: GPT-4, Google Bard (currently Google Gemini), and Large Language Models by Meta AI 2 (LLaMA2). The primary outcome was focused on whether GPT-4’s evaluations identified the final diagnosis within these lists. None of these AIs received additional medical training or reinforcement. For comparison, 2 independent physicians also evaluated the lists, with any inconsistencies resolved by another physician.

RESULTS

The 3 AIs generated a total of 1176 differential diagnosis lists from 392 case descriptions. GPT-4’s evaluations concurred with those of the physicians in 966 out of 1176 lists (82.1%). The Cohen κ coefficient was 0.63 (95% CI 0.56-0.69), indicating a fair to good agreement between GPT-4 and the physicians’ evaluations.

CONCLUSIONS

GPT-4 demonstrated a fair to good agreement in identifying the final diagnosis from differential-diagnosis lists, comparable to physicians for case report series. Its ability to compare differential diagnosis lists with final diagnoses suggests its potential to aid clinical decision-making support through diagnostic feedback. While GPT-4 showed a fair to good agreement for evaluation, its application in real-world scenarios and further validation in diverse clinical environments are essential to fully understand its utility in the diagnostic process.

Publisher

JMIR Publications Inc.

Reference35 articles.

1. Improving Diagnosis in Health Care

2. Diagnostic Errors in Medicine: A Case of Neglect

3. Burden of serious harms from diagnostic error in the USA

4. Diagnostic Error in Internal Medicine

5. Diagnostic Error in Medicine