An experimental study measuring the generalization of fine‐tuned language representation models across commonsense reasoning benchmarks-Reference-Cited by-同舟云学术

An experimental study measuring the generalization of fine‐tuned language representation models across commonsense reasoning benchmarks

Published:2023-02-10 Issue:5 Volume:40 Page:
ISSN:0266-4720
Container-title:Expert Systems
language:en
Short-container-title:Expert Systems

Author:

Shen Ke¹,Kejriwal Mayank¹^ORCID

Affiliation:

1. Information Sciences Institute University of Southern California Los Angeles California USA

Abstract

AbstractIn the last 5 years, language representation models, such as BERT and GPT‐3, based on transformer neural networks, have led to enormous progress in natural language processing (NLP). One such NLP task is commonsense reasoning, where performance is usually evaluated through multiple‐choice question answering benchmarks. Till date, many such benchmarks have been proposed, and ‘leaderboards’ tracking state‐of‐the‐art performance on those benchmarks suggest that transformer‐based models are approaching human‐like performance. Because these are commonsense benchmarks, however, such a model should be expected to generalize, that is, at least in aggregate, should not exhibit excessive performance loss across independent commonsense benchmarks regardless of the specific benchmark on (the training set of) which it has been fine‐tuned. In this article, we evaluate this expectation by proposing a methodology and experimental study to measure the generalization ability of language representation models using a rigorous and intuitive metric. Using five established commonsense reasoning benchmarks, our experimental study shows that the models do not generalize well, and may be (potentially) susceptible to issues such as dataset bias. The results therefore suggest that current performance on benchmarks may be an over‐estimate, especially if we want to use such models on novel commonsense problems for which a ‘training’ dataset may not be available, for the language representation model, to fine‐tune on.

Funder

Defense Advanced Research Projects Agency

Publisher

Wiley

Subject

Artificial Intelligence,Computational Theory and Mathematics,Theoretical Computer Science,Control and Systems Engineering

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.13243

Reference82 articles.

1. Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey

2. Angeli G. &Manning C.(2014).Naturalli: Natural logic inference for common sense reasoning. pp.534–545. doi:https://doi.org/10.3115/v1/D14-1059

3. Baroni M. Joulin A. Jabri A. Kruszewski G. Lazaridou A. Simonic K. &Mikolov T.(2017).Commai: Evaluating the first steps towards a useful general AI. arXiv:1701.08954.

4. Basu K.(2019).Conversational AI: Open domain question answering and commonsense reasoning. arXiv:1909.08258.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Reasoning and Value Alignment Test to Assess Advanced GPT Reasoning;ACM Transactions on Interactive Intelligent Systems;2024-08-02

2. A noise audit of human-labeled benchmarks for machine commonsense reasoning;Scientific Reports;2024-04-14

3. Navigating the security landscape of large language models in enterprise information systems;Enterprise Information Systems;2024-02-07

4. Substructure Discovery in Commonsense Relations Using Graph Representation Learning;Lecture Notes in Networks and Systems;2024

5. Evaluating deep generative models on cognitive tasks: a case study;Discover Artificial Intelligence;2023-06-06