Meaning-Sensitive Text Data Augmentation with Intelligent Masking

Author:

Kasthuriarachchy Buddhika1ORCID,Chetty Madhu1ORCID,Shatte Adrian1ORCID,Walls Darren2ORCID

Affiliation:

1. Federation University Australia, Australia

2. Global Hosts Pty Ltd, Australia

Abstract

With the recent popularity of applying large-scale deep neural network-based models for natural language processing (NLP), attention to develop methods for text data augmentation is at its peak, since the limited size of training data tends to significantly affect the accuracy of these models. To this end, we propose a novel text data augmentation technique called Intelligent Masking with Optimal Substitutions Text Data Augmentation (IMOSA). IMOSA, developed for labelled sentences, can identify the most favourable sentences and locate the appropriate word combinations in a particular sentence to replace and generate synthetic sentences with a meaning closer to the original sentence, while also significantly increasing the diversity of the dataset. We demonstrate that the proposed technique notably improves the performance of classifiers based on attention-based transformer models through the extensive experiments for five different text classification tasks which are performed under the low data regime in a context-aware NLP setting. The analysis clearly shows that IMOSA effectively generates more sentences using favourable original examples and completely ignores undesirable examples. Furthermore, the experiments carried out confirm IMOSA’s ability to add diversity to the augmented dataset using multiple distinct masking patterns against the same original sentence, which remarkably adds variety to the training dataset. IMOSA consistently outperforms the two key masked language model-based text data augmentation techniques, and demonstrates a robust performance against the critical challenging NLP tasks.

Funder

Global Hosts Pty Ltd trading as SportsHosts

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Theoretical Computer Science

Reference39 articles.

1. Reconciling modern machine-learning practice and the classical bias–variance trade-off

2. Generating Sentences from a Continuous Space

3. What Does BERT Look at? An Analysis of BERT’s Attention

4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT.

5. Yoav Goldberg. 2019. Assessing BERT's syntactic abilities. Retrieved from https://arxiv.org/abs/1901.05287

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3