A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification-Reference-Cited by-同舟云学术

A Comparative Survey of Instance Selection Methods applied to Non-Neural and Transformer-Based Text Classification

Published:2023-07-13 Issue:13s Volume:55 Page:1-52
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Cunha Washington¹^ORCID,Viegas Felipe¹^ORCID,França Celso¹^ORCID,Rosa Thierson²^ORCID,Rocha Leonardo³^ORCID,Gonçalves Marcos André¹^ORCID

Affiliation:

1. Federal University of Minas Gerais

2. Federal University of Goiás

3. Federal University of São João Del-Rei

Abstract

Progress in natural language processing has been dictated by the rule of more : more data, more computing power, more complexity, best exemplified by deep learning Transformers. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. One way to ameliorate this problem is through data engineering instead of the algorithmic or hardware perspectives. Our focus here is an under-investigated data engineering technique, with enormous potential in the current scenario – Instance Selection (IS) (a.k.a. Selective Sampling, Prototype Selection). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. We survey classical and recent state-of-the-art IS techniques and provide a scientifically sound comparison of IS methods applied to an essential natural language processing task—Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural state-of-the-art ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod (training set reduction, effectiveness, and efficiency). Our answers reveal an enormous unfulfilled potential for IS solutions. Specially, we show that in 12 out of 19 datasets, specific IS methods—namely, Condensed Nearest Neighbor, Local Set-based Smoother, and Local Set Border Selector—can reduce the size of the training set without effectiveness losses. Furthermore, in the case of fine-tuning the Transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains.

Funder

CNPq

CAPES

FAPEMIG

Amazon Web Services

NVIDIA

Google Research Awards

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Link