A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification

Author:

Cunha Washington1ORCID,Viegas Felipe1ORCID,França Celso1ORCID,Rosa Thierson2ORCID,Rocha Leonardo3ORCID,Gonçalves Marcos André1ORCID

Affiliation:

1. Federal University of Minas Gerais

2. Federal University of Goiás

3. Federal University of São João Del-Rei

Abstract

Progress in natural language processing has been dictated by the rule of more : more data, more computing power, more complexity, best exemplified by deep learning Transformers. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. One way to ameliorate this problem is through data engineering instead of the algorithmic or hardware perspectives. Our focus here is an under-investigated data engineering technique, with enormous potential in the current scenario – Instance Selection (IS) (a.k.a. Selective Sampling, Prototype Selection). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. We survey classical and recent state-of-the-art IS techniques and provide a scientifically sound comparison of IS methods applied to an essential natural language processing task—Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural state-of-the-art ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod (training set reduction, effectiveness, and efficiency). Our answers reveal an enormous unfulfilled potential for IS solutions. Specially, we show that in 12 out of 19 datasets, specific IS methods—namely, Condensed Nearest Neighbor, Local Set-based Smoother, and Local Set Border Selector—can reduce the size of the training set without effectiveness losses. Furthermore, in the case of fine-tuning the Transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains.

Funder

CNPq

CAPES

FAPEMIG

Amazon Web Services

NVIDIA

Google Research Awards

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Reference83 articles.

1. Deep Learning in Sentiment Analysis: Recent Architectures

2. Instance-based learning algorithms

3. Impact of instance selection on kNN-based text categorization.;Barigou Fatiha;Journal of Information Processing Systems,2018

4. Google Scholar's ranking algorithm: The impact of citation counts (An empirical study)

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3