A noise audit of human-labeled benchmarks for machine commonsense reasoning

Author:

Kejriwal Mayank,Santos Henrique,Shen Ke,Mulvehill Alice M.,McGuinness Deborah L.

Abstract

AbstractWith the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single ‘ground-truth’). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale, more realistic online crowdsourced setting. Using Kahneman’s framework of noise, our results consistently show non-trivial amounts of level, pattern, and system noise, even in the higher-quality setting, with comparable results in the crowdsourced setting. We find that noise can significantly influence the performance estimates that we obtain of commonsense reasoning systems, even if the ‘system’ is a human; in some cases, by almost 10 percent. Labeling noise also affects performance estimates of systems like ChatGPT by more than 4 percent. Our results suggest that the default practice in the AI community of assuming and using a ‘single’ ground-truth, even on problems requiring seemingly straightforward human judgment, may warrant empirical and methodological re-visiting.

Funder

Defense Advanced Research Projects Agency

Publisher

Springer Science and Business Media LLC

Reference50 articles.

1. Storks, S., Gao, Q. & Chai, J. Y. Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv:1904.01172 [cs] (2020).

2. Minsky, M. The Emotion Machine: Commonsense Thinking, Artificial Intelligence, and the Future of the Human Mind (Simon & Schuster, New York, 2007) (reprint edition).

3. Davis, E. & Marcus, G. Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM 58, 92–103. https://doi.org/10.1145/2701413 (2015).

4. Kejriwal, M., Santos, H., Mulvehill, A. M. & McGuinness, D. L. Designing a strong test for measuring true common-sense reasoning. Nat. Mach. Intell. 4, 318–322 (2022).

5. Levesque, H., Davis, E. & Morgenstern, L. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning (2012).

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3