Automated Testing Linguistic Capabilities of NLP Models

Author:

Lee Jaeseong1ORCID,Chen Simin1ORCID,Mordahl Austin1ORCID,Liu Cong2ORCID,Yang Wei1ORCID,Wei Shiyi1ORCID

Affiliation:

1. The University of Texas at Dallas, USA

2. University of California, Riverside, USA

Abstract

Natural language processing (NLP) has gained widespread adoption in the development of real-world applications. However, the black-box nature of neural networks in NLP applications poses a challenge when evaluating their performance, let alone ensuring it. Recent research has proposed testing techniques to enhance the trustworthiness of NLP-based applications. However, most existing works use a single, aggregated metric ( i.e ., accuracy) which is difficult for users to assess NLP model performance on fine-grained aspects such as linguistic capabilities. To address this limitation, we present ALiCT, an automated testing technique for validating NLP applications based on their linguistic capabilities. ALiCT takes user-specified linguistic capabilities as inputs and produce diverse test suite with test oracles for each of given linguistic capability. We evaluate ALiCT on two widely adopted NLP tasks, sentiment analysis and hate speech detection, in terms of diversity, effectiveness, and consistency. Using Self-BLEU and syntactic diversity metrics, our findings reveal that ALiCT generates test cases that are 190% and 2213% more diverse in semantics and syntax, respectively, compared to those produced by state-of-the-art techniques. In addition, ALiCT is capable of producing a larger number of NLP model failures in 22 out of 25 linguistic capabilities over the two NLP applications.

Publisher

Association for Computing Machinery (ACM)

Reference83 articles.

1. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems

2. Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA), Valletta, Malta. http://www.lrec-conf.org/proceedings/lrec2010/pdf/769_Paper.pdf

3. Cats are not fish

4. Potential use of chat gpt in global warming;Biswas Som S;Annals of biomedical engineering,2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3