Holistic Evaluation of Language Models-Reference-Cited by-同舟云学术

Holistic Evaluation of Language Models

Published:2023-05-25 Issue:1 Volume:1525 Page:140-146
ISSN:0077-8923
Container-title:Annals of the New York Academy of Sciences
language:en
Short-container-title:Annals of the New York Academy of Sciences

Author:

Bommasani Rishi¹,Liang Percy¹,Lee Tony¹

Affiliation:

1. Center for Research on Foundation Models Stanford University Stanford California USA

Abstract

AbstractLanguage models (LMs) like GPT‐3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade‐offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top‐level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models https://crfm.stanford.edu/helm/latest/.

Publisher

Wiley

Subject

History and Philosophy of Science,General Biochemistry, Genetics and Molecular Biology,General Neuroscience

Link

https://nyaspubs.onlinelibrary.wiley.com/doi/pdf/10.1111/nyas.15007

Reference65 articles.

1. Bommasani R. Hudson D. A. Adeli E. Altman R. Arora S. vonArx S. Bernstein M. S. Bohg J. Bosselut A. Brunskill E. Brynjolfsson E. Buch S. Card D. Castellon R. Chatterji N. Chen A. Creel K. Davis J. Q. Demszky D. …Liang P.(2021).On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258.

2. Ethayarajh K. &Jurafsky D.(2020).Utility is in the eye of the user: A critique of NLP leaderboards. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).Association for Computational Linguistics. 4846–4853.

3. Birhane A. Kalluri P. Card D. Agnew W. Dotan R. &Bao M.(2022).The values encoded in machine learning research. In2022 ACM Conference on Fairness Accountability and Transparency FAccT ’22.New York:Association for Computing Machinery.

4. Some Points in a Time

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Viability of Open Large Language Models for Clinical Documentation in German Health Care: Real-World Model Evaluation Study;JMIR Medical Informatics;2024-08-28

2. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models;Artificial Intelligence Review;2024-08-10

3. Comparative Analysis of AI Systems and Human Nutrition Knowledge: Evaluating ChatGPT and Other AI Systems Against Dietetics Students and the General Population (Preprint);2024-07-26

4. Generative artificial intelligence, patient safety and healthcare quality: a review;BMJ Quality & Safety;2024-07-24

5. The life cycle of large language models in education: A framework for understanding sources of bias;British Journal of Educational Technology;2024-07-12