Large language models encode clinical knowledge-Reference-Cited by-同舟云学术

Large language models encode clinical knowledge

Published:2023-07-12 Issue:7972 Volume:620 Page:172-180
ISSN:0028-0836
Container-title:Nature
language:en
Short-container-title:Nature

Author:

Singhal Karan,Azizi Shekoofeh^ORCID,Tu Tao,Mahdavi S. Sara,Wei Jason,Chung Hyung Won,Scales Nathan,Tanwani Ajay,Cole-Lewis Heather,Pfohl Stephen,Payne Perry,Seneviratne Martin,Gamble Paul,Kelly Chris^ORCID,Babiker Abubakr,Schärli Nathanael,Chowdhery Aakanksha,Mansfield Philip^ORCID,Demner-Fushman Dina,Agüera y Arcas Blaise,Webster Dale^ORCID,Corrado Greg S.,Matias Yossi^ORCID,Chou Katherine,Gottweis Juraj,Tomasev Nenad^ORCID,Liu Yun^ORCID,Rajkomar Alvin,Barral Joelle,Semturs Christopher^ORCID,Karthikesalingam Alan,Natarajan Vivek

Abstract

AbstractLarge language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

Publisher

Springer Science and Business Media LLC

Subject

Multidisciplinary

Link

https://www.nature.com/articles/s41586-023-06291-2.pdf

Reference94 articles.

1. Chowdhery, A. et al. PaLM: scaling language modeling with pathways. Preprint at https://doi.org/10.48550/arXiv.2204.02311 (2022).

2. Chung, H. W. et al. Scaling instruction-finetuned language models. Preprint at https://doi.org/10.48550/arXiv.2210.11416 (2022).

3. Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. 11, 6421 (2021).

4. Pal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning 248–260 (Proceedings of Machine Learning Research, 2022).

5. Jin, Q., Dhingra, B., Liu, Z., Cohen, W. W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. Preprint at https://doi.org/10.48550/arXiv.1909.06146 (2019).

Cited by 144 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Potential merits and flaws of large language models in epilepsy care: A critical review;Epilepsia;2024-02-02

2. A Comparative Study of Responses to Retina Questions from either Experts, Expert-Edited Large Language Models (LLMs) or LLMs Alone;Ophthalmology Science;2024-02

3. Informed consent for artificial intelligence in emergency medicine: A practical guide;The American Journal of Emergency Medicine;2024-02

4. Large language models for biomolecular analysis: From methods to applications;TrAC Trends in Analytical Chemistry;2024-02

5. A systematic evaluation of the performance of GPT‐4 and PaLM2 to diagnose comorbidities in MIMIC‐IV patients;Health Care Science;2024-02