MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records-Reference-Cited by-同舟云学术

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Published:2024-03-24 Issue:20 Volume:38 Page:22021-22030
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Fleming Scott L.,Lozano Alejandro,Haberkorn William J.,Jindal Jenelle A.,Reis Eduardo,Thapa Rahul,Blankemeier Louis,Genkins Julian Z.,Steinberg Ethan,Nayak Ashwin,Patel Birju,Chiang Chia-Chun,Callahan Alison,Huo Zepeng,Gatidis Sergios,Adams Scott,Fayanju Oluseyi,Shah Shreya J.,Savage Thomas,Goh Ethan,Chaudhari Akshay S.,Aghaeepour Nima,Sharp Christopher,Pfeffer Michael A.,Liang Percy,Chen Jonathan H.,Morse Keith E.,Brunskill Emma P.,Fries Jason A.,Shah Nigam H.

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A multi-scale embedding network for unified named entity recognition in Chinese Electronic Medical Records;Alexandria Engineering Journal;2024-11

2. The Most Disruptive Near-Term Use of AI in Cancer Care: Patient Empowerment Through Software Agents;AI in Precision Oncology;2024-08-30

3. Evaluating the clinical benefits of LLMs;Nature Medicine;2024-07-26

4. Application of Artificial Intelligence in the Headache Field;Current Pain and Headache Reports;2024-07-08

5. Unifying Corroborative and Contributive Attributions in Large Language Models;2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML);2024-04-09