Quantifying confidence shifts in a BERT-based question answering system evaluated on perturbed instances-Reference-Cited by-同舟云学术

Quantifying confidence shifts in a BERT-based question answering system evaluated on perturbed instances

Published:2023-12-20 Issue:12 Volume:18 Page:e0295925
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Shen Ke,Kejriwal Mayank^ORCID

Abstract

Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language processing (NLP) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on systematically evaluating such models in ambiguous situations where (for example) no correct answer exists for a given prompt among the provided set of choices. Such ambiguous situations are not infrequent in real world applications. We design and conduct an experimental study of this phenomenon using three probes that aim to ‘confuse’ the model by perturbing QA instances in a consistent and well-defined manner. Using a detailed set of results based on an established transformer-based multiple-choice QA system on two established benchmark datasets, we show that the model’s confidence in its results is very different from that of an expected model that is ‘agnostic’ to all choices that are incorrect. Our results suggest that high performance on idealized QA instances should not be used to infer or extrapolate similarly high performance on more ambiguous instances. Auxiliary results suggest that the model may not be able to distinguish between these two situations with sufficient certainty. Stronger testing protocols and benchmarking may hence be necessary before such models are deployed in front-facing systems or ambiguous decision making with significant human impact.

Funder

Defense Sciences Office, DARPA

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference66 articles.

1. Natural language question answering: the view from here;L Hirschman;natural language engineering,2001

2. Siblini W, Pasqual C, Lavielle A, Cauchois C. Multilingual question answering from formatted text applied to conversational agents. arXiv preprint arXiv:191004659. 2019;.

3. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.

4. GPT-3: Its nature, scope, limits, and consequences;L Floridi;Minds and Machines,2020

5. Lee JS, Hsiang J. Patentbert: Patent classification with fine-tuning a pre-trained bert model. arXiv preprint arXiv:190602124. 2019;.