An Empirical Survey of Data Augmentation for Limited Data Learning in NLP-Reference-Cited by-同舟云学术

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

Published:2023 Issue: Volume:11 Page:191-211
ISSN:2307-387X
Container-title:Transactions of the Association for Computational Linguistics
language:en
Short-container-title:

Author:

Chen Jiaao¹,Tam Derek²,Raffel Colin³,Bansal Mohit⁴,Yang Diyi⁵

Affiliation:

1. Georgia Institute of Technology, USA. jchen896@gatech.edu

2. UNC Chapel Hill, USA. dtredsox@cs.unc.edu

3. UNC Chapel Hill, USA. craffel@cs.unc.edu

4. UNC Chapel Hill, USA. mbansal@cs.unc.edu

5. Georgia Institute of Technology, USA. dyang888@gatech.edu

Abstract

AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.

Publisher

MIT Press

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication

Link

https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00542/2074871/tacl_a_00542.pdf

Reference150 articles.

1. Cross lingual transfer learning for zero-resource domain adaptation;Abad;ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2020

2. Do not have enough data? Deep learning to the rescue!;Anaby-Tavor,2020

3. Good-enough compositional data augmentation;Andreas,2020

4. Unsupervised neural machine translation;Artetxe,2018

5. Multi-task learning of pairwise sequence classification tasks over disparate label spaces;Augenstein,2018

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Frontiers and developments of data augmentation for image: From unlearnable to learnable;Information Fusion;2025-02

2. A cross-temporal contrastive disentangled model for ancient Chinese understanding;Neural Networks;2024-11

3. On the effectiveness of hybrid pooling in mixup-based graph learning for language processing;Journal of Systems and Software;2024-10

4. A Multi-Scale Target Detection Method Using an Improved Faster Region Convolutional Neural Network Based on Enhanced Backbone and Optimized Mechanisms;Journal of Imaging;2024-08-13

5. Few-shot biomedical relation extraction using data augmentation and domain information;Neurocomputing;2024-08