Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning-Reference-Cited by-同舟云学术

Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning

Published:2023-09-12 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Johri Shreya^ORCID,Jeong Jaehwan,Tran Benjamin A.,Schlessinger Daniel I.,Wongvibulsin Shannon,Cai Zhuo Ran,Daneshjou Roxana,Rajpurkar Pranav^ORCID

Abstract

AbstractThe integration of Large Language Models (LLMs) like GPT-4 and GPT-3.5 into clinical diagnostics has the potential to transform patient-doctor interactions. However, the readiness of these models for real-world clinical application remains inadequately tested. This paper introduces the Conversational Reasoning Assessment Framework for Testing in Medicine (CRAFT-MD), a novel approach for evaluating clinical LLMs. Unlike traditional methods that rely on structured medical exams, CRAFT-MD focuses on natural dialogues, using simulated AI agents to interact with LLMs in a controlled, ethical environment. We applied CRAFT-MD to assess the diagnostic capabilities of GPT-4 and GPT-3.5 in the context of skin diseases. Our experiments revealed critical insights into the limitations of current LLMs in terms of clinical conversational reasoning, history taking, and diagnostic accuracy. Based on these findings, we propose a comprehensive set of guidelines for future evaluations of clinical LLMs. These guidelines emphasize realistic doctor-patient conversations, comprehensive history taking, open-ended questioning, and a combination of automated and expert evaluations. The introduction of CRAFT-MD marks a significant advancement in LLM testing, aiming to ensure that these models augment medical practice effectively and ethically.

Publisher

Cold Spring Harbor Laboratory

Reference37 articles.

1. Access to Care, Health Status, and Health Disparities in the United States and Canada: Results of a Cross-National Population-Based Survey

2. International variations in primary care physician consultation time: a systematic review of 67 countries

3. Dermatology consultations: how long do they take?

4. The State of Telehealth Before and After the COVID-19 Pandemic;Prim. Care,2022

5. Bubeck, S. , et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. (2023).

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use;Ethics and Information Technology;2024-07-17

2. Understanding natural language: Potential application of large language models to ophthalmology;Asia-Pacific Journal of Ophthalmology;2024-07

3. Large Language and Vision Assistant in dermatology: a game changer or just hype?;Clinical and Experimental Dermatology;2024-04-04

4. Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges;2024-03-05