Abstract
Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language processing (NLP) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on systematically evaluating such models in ambiguous situations where (for example) no correct answer exists for a given prompt among the provided set of choices. Such ambiguous situations are not infrequent in real world applications. We design and conduct an experimental study of this phenomenon using three probes that aim to ‘confuse’ the model by perturbing QA instances in a consistent and well-defined manner. Using a detailed set of results based on an established transformer-based multiple-choice QA system on two established benchmark datasets, we show that the model’s confidence in its results is very different from that of an expected model that is ‘agnostic’ to all choices that are incorrect. Our results suggest that high performance on idealized QA instances should not be used to infer or extrapolate similarly high performance on more ambiguous instances. Auxiliary results suggest that the model may not be able to distinguish between these two situations with sufficient certainty. Stronger testing protocols and benchmarking may hence be necessary before such models are deployed in front-facing systems or ambiguous decision making with significant human impact.
Funder
Defense Sciences Office, DARPA
Publisher
Public Library of Science (PLoS)
Reference66 articles.
1. Natural language question answering: the view from here;L Hirschman;natural language engineering,2001
2. Siblini W, Pasqual C, Lavielle A, Cauchois C. Multilingual question answering from formatted text applied to conversational agents. arXiv preprint arXiv:191004659. 2019;.
3. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
4. GPT-3: Its nature, scope, limits, and consequences;L Floridi;Minds and Machines,2020
5. Lee JS, Hsiang J. Patentbert: Patent classification with fine-tuning a pre-trained bert model. arXiv preprint arXiv:190602124. 2019;.