Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models-Reference-Cited by-同舟云学术

Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models

Published:2024-05-09 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

do Olmo Juanjo,Logroño Javier,Mascías Carlos,Martínez Marcelo,Isla Julián

Abstract

AbstractDiagnosing rare diseases is a significant challenge in healthcare, with patients often experiencing long delays and misdiagnoses. The large number of rare diseases and the difficulty for doctors to be familiar with all of them contribute to this problem. Artificial intelligence, particularly large language models (LLMs), has shown promise in improving the diagnostic process by leveraging their extensive knowledge to help doctors navigate the complexities of diagnosing rare diseases.Foundation 29 presents a comprehensive evaluation of DxGPT, a web-based platform designed to assist healthcare professionals in the diagnostic process for rare diseases. The platform currently utilizes GPT-4, but this study also compares its performance with other large language models, including Claude 3, Gemini 1.5 Pro, Llama, Mistral, Mixtral, and Cohere Command R+. It is crucial to emphasize that DxGPT is not a medical device but rather a decision support tool that aims to aid in clinical reasoning.This study extends beyond initial synthetic patient cases, incorporating real-world data from the RAMEDIS and Peking Union Medical College Hospital (PUMCH) datasets. The analysis followed two main metrics: Strict Accuracy (P1), how often the first diagnostic suggestion agreed with the real diagnosis, and Top-5 Accuracy (P1 + P5), how often the right diagnosis was in the top five suggestions. The results show a complex picture of diagnostic accuracy, with performance varying significantly across models and datasets:

On the synthetic dataset, closed models like GPT-4, Claude, and Gemini exhibited relatively high accuracy. Open models like Llama 3 and Mixtral performed reasonably well, though lagging behind.

On the RAMEDIS rare disease cases, Claude 3 Opus model demonstrated 55% Strict Accuracy and 70% Top-5 Accuracy, outperforming other closed models. Open models like Llama 3 and Mixtral showed moderate accuracy.

The PUMCH dataset proved challenging for all models, with the highest Strict Accuracy at 59.46% (GPT-4 Turbo 1106) and Top-5 Accuracy at 64.86%.

These findings demonstrate the potential of DxGPT and LLMs in improving diagnostic methods for rare diseases. However, they also emphasize the need for further validation, particularly in real-world clinical settings, and comparison with human expert diagnoses. Successful integration of AI into medical diagnostics will require collaboration between researchers, clinicians, and regulatory bodies to ensure safety, efficacy, and ethical deployment.

Publisher

Cold Spring Harbor Laboratory

Reference27 articles.

1. Diagnostic delay in rare diseases: data from the Spanish rare diseases patient registry

2. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome

3. Can a decision support system accelerate rare disease diagnosis? Evaluating the potential impact of Ada DX in a retrospective study

4. How many rare diseases are there?

5. The unsolved rare genetic disease atlas? An analysis of the unexplained phenotypic descriptions in OMIM®

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports;American Journal of Medical Genetics Part A;2024-09-13

2. Evaluation of the Clinical Utility of DxGPT, a GPT-4 Based Large Language Model, through an Analysis of Diagnostic Accuracy and User Experience;2024-07-26