Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA-Reference-Cited by-同舟云学术

Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA

Published:2024-07-22 Issue: Volume:26 Page:e58158
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Chen Xi^ORCID,Wang Li^ORCID,You MingKe^ORCID,Liu WeiZhi^ORCID,Fu Yu^ORCID,Xu Jie^ORCID,Zhang Shaoting^ORCID,Chen Gang^ORCID,Li Kang^ORCID,Li Jian^ORCID

Abstract

Background The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. Objective This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. Methods A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. Conclusions This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs.

Publisher

JMIR Publications Inc.

Reference32 articles.

1. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge

2. Large language models encode clinical knowledge

3. Large language models in medicine

4. AI in Medicine—JAMA’s Focus on Clinical Outcomes, Patient-Centered Care, Quality, and Equity