Cross-functional Analysis of Generalization in Behavioral Learning

Author:

Luz de Araujo Pedro Henrique12,Roth Benjamin13

Affiliation:

1. Faculty of Computer Science, University of Vienna, Vienna, Austria

2. UniVie Doctoral School Computer Science, Vienna, Austria. pedro.henrique.luz.de.araujo@univie.ac.at

3. Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria. benjamin.roth@univie.ac.at

Abstract

Abstract In behavioral testing, system functionalities underrepresented in the standard evaluation setting (with a held-out test set) are validated through controlled input-output pairs. Optimizing performance on the behavioral tests during training (behavioral learning) would improve coverage of phenomena not sufficiently represented in the i.i.d. data and could lead to seemingly more robust models. However, there is the risk that the model narrowly captures spurious correlations from the behavioral test suite, leading to overestimation and misrepresentation of model performance—one of the original pitfalls of traditional evaluation. In this work, we introduce BeLUGA, an analysis method for evaluating behavioral learning considering generalization across dimensions of different granularity levels. We optimize behavior-specific loss functions and evaluate models on several partitions of the behavioral test suite controlled to leave out specific phenomena. An aggregate score measures generalization to unseen functionalities (or overfitting). We use BeLUGA to examine three representative NLP tasks (sentiment analysis, paraphrase identification, and reading comprehension) and compare the impact of a diverse set of regularization and domain generalization methods on generalization performance.1

Publisher

MIT Press

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication

Reference30 articles.

1. Invariant risk minimization;Arjovsky;CoRR,2019

2. Analysis methods in neural language processing: A survey;Belinkov;Transactions of the Association for Computational Linguistics,2019

3. BERT: Pre-training of deep bidirectional transformers for language understanding;Devlin,2019

4. IRM—when it works and when it doesn’t: A test case of natural language inference;Dranker,2021

5. First quora dataset release: Question pairs;Iyer,2017

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3