Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment-Reference-Cited by-同舟云学术

Large Language Model Uncertainty Measurement and Calibration for Medical Diagnosis and Treatment

Published:2024-06-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Savage Thomas^ORCID,Wang John,Gallo Robert,Boukil Abdessalem,Patel Vishwesh,Ahmad Safavi-Naini Seyed Amir,Soroush Ali^ORCID,Chen Jonathan H

Abstract

AbstractIntroductionThe inability of Large Language Models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to measure uncertainty in ways that are useful to physician-users.ObjectiveEvaluate the ability for uncertainty metrics to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.MethodsWe examined Confidence Elicitation, Token Level Probability, and Sample Consistency metrics across GPT3.5, GPT4, Llama2 and Llama3. Uncertainty metrics were evaluated against three datasets of open-ended patient scenarios.ResultsSample Consistency methods outperformed Token Level Probability and Confidence Elicitation methods. Sample Consistency by Sentence Embedding achieved the highest discrimination performance (ROC AUC 0.68–0.79) with poor calibration, while Sample Consistency by GPT Annotation achieved the second-best discrimination (ROC AUC 0.66-0.74) with more accurate calibration. Nearly all uncertainty metrics had better discriminative performance with diagnosis rather than treatment selection questions. Furthermore, verbalized confidence (Confidence Elicitation) was found to consistently over-estimate model confidence.ConclusionsSample Consistency is the most effective method for estimating LLM uncertainty of the metrics evaluated. Sample Consistency by Sentence Embedding can effectively estimate uncertainty if the user has a set of reference cases with which to re-calibrate their results, while Sample Consistency by GPT Annotation is more effective method if the user does not have reference cases and requires accurate raw calibration. Our results confirm LLMs are consistently over-confident when verbalizing their confidence through Confidence Elicitation.

Publisher

Cold Spring Harbor Laboratory

Reference45 articles.

1. Large language models in medicine;Nat. Med,2023

2. Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations