How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Author:

Jiang Zhengbao1,Araki Jun2,Ding Haibo3,Neubig Graham4

Affiliation:

1. Languages Technologies Institute, Carnegie Mellon University, United States. zhengbaj@cs.cmu.edu

2. Bosch Research, United States. jun.araki@us.bosch.com

3. Bosch Research, United States. haibo.ding@us.bosch.com

4. Languages Technologies Institute, Carnegie Mellon University, United States. gneubig@cs.cmu.edu@cs.cmu.edu

Abstract

Abstract Recent works have shown that language models (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question, “How can we know when language models know, with confidence, the answer to a particular query?” We examine this question from the point of view of calibration, the property of a probabilistic model’s predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models—T5, BART, and GPT-2—and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

Publisher

MIT Press - Journals

Reference65 articles.

1. A neural probabilistic language model;Bengio;Journal of Machine Learning Research,2003

2. PIQA: Reasoning about physical commonsense in natural language;Bisk,2020

3. COMET: Commonsense transformers for automatic knowledge graph construction;Bosselut,2019

4. Inducing relational knowledge from BERT;Bouraoui,2020

5. Language models are few-shot learners;Brown;CoRR,2020

Cited by 39 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. From text to multimodal: a survey of adversarial example generation in question answering systems;Knowledge and Information Systems;2024-08-09

2. LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud Incidents;Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering;2024-07-10

3. Steering Large Language Models for Cross-lingual Information Retrieval;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10

4. Extraction of Subjective Information from Large Language Models;2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC);2024-07-02

5. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports;Radiology: Artificial Intelligence;2024-07-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3