Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

Author:

Adlakha Vaibhav12,BehnamGhader Parishad3,Lu Xing Han4,Meade Nicholas5,Reddy Siva627

Affiliation:

1. Mila, McGill University, Canada. vaibhav.adlakha@mila.quebec

2. ServiceNow Research, Canada

3. Mila, McGill University, Canada. parishad.behnamghader@mila.quebec

4. Mila, McGill University, Canada. xing-han.lu@mila.quebec

5. Mila, McGill University, Canada. nicholas.meade@mila.quebec

6. Mila, McGill University, Canada. siva.reddy@mila.quebec

7. Facebook CIFAR AI Chair, Canada

Abstract

Abstract Instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA). By simply prepending relevant documents and an instruction to their input, these models can be adapted to various information domains and tasks without additional training. However, these models tend to produce verbose responses with supplementary information, which makes traditional QA metrics like exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we evaluate instruction-following models along two fronts: 1) how well they satisfy user’s information need (correctness), and 2) whether they disseminate information supported by the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness and propose simple token-overlap metrics that correlate highly with human judgments. Our analysis reveals that for correctness, instruction-following models perform comparably to models specifically fine-tuned for that task. However, they struggle to accurately judge the relevance of the provided knowledge and often hallucinate in their responses. We hope our work encourages more holistic evaluation of instruction-following models for QA. Our code and human annotation data is available at https://github.com/McGill-NLP/instruct-qa.

Publisher

MIT Press

Reference55 articles.

1. TopiOCQA: Open-domain conversational question answering with topic switching;Adlakha;Transactions of the Association for Computational Linguistics,2022

2. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments;Banerjee,2005

3. Language models are few-shot learners;Brown,2020

4. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation;Bulian,2022

5. Can large language models be an alternative to human evaluations?;Chiang,2023

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. On Early Detection of Hallucinations in Factual Question Answering;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

2. Towards trustworthy LLMs: a review on debiasing and dehallucinating in large language models;Artificial Intelligence Review;2024-08-10

3. Comprehending and Reducing LLM Hallucinations;International Journal of Innovative Science and Research Technology (IJISRT);2024-07-31

4. Performance Analysis of Llama 2 Among Other LLMs;2024 IEEE Conference on Artificial Intelligence (CAI);2024-06-25

5. Design of an Autonomous Cyber Defence Agent using Hybrid AI models;2024 International Conference on Military Communication and Information Systems (ICMCIS);2024-04-23

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3