Evaluating Large Language Models on Medical Evidence Summarization-Reference-Cited by-同舟云学术

Evaluating Large Language Models on Medical Evidence Summarization

Published:2023-04-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Tang Liyan,Sun Zhaoyi,Idnay Betina,Nestor Jordan G,Soroush Ali,Elias Pierre A.,Xu Ziyang,Ding Ying,Durrett Greg,Rousseau Justin^ORCID,Weng Chunhua,Peng Yifan^ORCID

Abstract

AbstractRecent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study has demonstrated that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

Publisher

Cold Spring Harbor Laboratory

Reference24 articles.

1. PaLM: Scaling language modeling with pathways;arXiv preprint,2022

2. Chain-of-thought prompting elicits reasoning in large language models;Advances in Neural Information Processing Systems,2022

3. Large language models are Zero-Shot reasoners;Advances in Neural Information Processing Systems,2022

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs);npj Digital Medicine;2024-07-08

2. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain;Systematic Reviews;2024-06-15

3. Can Large Language Models Provide Emergency Medical Help Where There Is No Ambulance? A Comparative Study on Large Language Model Understanding of Emergency Medical Scenarios in Resource-Constrained Settings;2024-04-19

4. Medical Reports Simplification Using Large Language Models;Lecture Notes in Networks and Systems;2024

5. Effective Natural Language Processing Algorithms for Gout Flare Early Alert from Chief Complaints;2023-11-29