Affiliation:
1. West China Hospital of Sichuan University
2. West China School of Medicine
3. irst Affiliated Hospital, College of Medicine, Zhejiang University
4. Sports Medicine Center, West China Hospital, West Chian School of Medicine, Sichuan University, Chengdu, Sichuan, China
5. Sichuan University
Abstract
Abstract
Importance
This study adopted multi-agent framework in large language models to enhance diagnosis in complex medical cases, particularly rare diseases, revealing limitation in current training and benchmarking of LLMs in healthcare.
Objective
This study aimed to develop MAC LLMs for medical diagnosis, and compare the knowledge base and diagnostic capabilities of GPT-3.5, GPT-4, and MAC in the context of rare diseases.
Design, Setting and Participants
This study examined 150 rare diseases using clinical case reports published after January 1, 2022, from the Medline database. Each case was curated, and both the initial and complete presentations were extracted to simulate the different stages of patient consultation. A MAC framework was developed. Disease knowledge base was tested using GPT-3.5, GPT-4, and the MAC. Each case was subjected to the three models to generate one most likely diagnosis, several possible diagnoses, and further diagnostic tests. The results were presented for panel discussions with physicians. Disease knowledge was evaluated. The accuracy and scoring of the one most likely diagnosis, several possible diagnoses, and further diagnostic tests were also evaluated.
Main Outcomes And Measures:
Scoring of disease knowledge. Accuracy and scoring of the one most likely diagnosis, several possible diagnoses and further diagnostic tests.
Results
In terms of disease-specific knowledge, GPT-3.5, GPT-4, and MAC scored above 4.5 on average for each aspect. In terms of diagnostic ability, MAC outperformed GPT-3.5 and GPT-4 in initial presentations, achieving higher accuracy in the most likely diagnoses (28%), possible diagnoses (47.3%), and further diagnostic tests (83.3%). GPT-3.5 and GPT-4 exhibited lower accuracy in these areas. In complete presentations, MAC continued to demonstrate higher accuracies in the most likely diagnosis (48.0%) and possible diagnoses (66.7%) compared to GPT-3.5 and GPT-4. Diagnostic capability scoring also indicated higher performance for MAC.
Conclusion And Relevance
Despite the comprehensive knowledge base of GPT-3.5 and GPT-4, a noticeable gap exists in their clinical application for diagnosing rare diseases, underscoring the limitations in the current training and benchmarking methods of LLMs within the healthcare sector. Compared with single-agent models, the MAC framework markedly improves the diagnostic ability of LLMs, enabling more in-depth analysis. Therefore, the MAC framework is a promising tool for the diagnosis of rare diseases in clinical settings and warrants further research to fully explore its potential.
Publisher
Research Square Platform LLC