Affiliation:
1. Changchun University of Technology, Changchun, China
2. School of Computer Science and Engineering, Changchun University of Technology, Changchun University of Technology, Changchun, China
Abstract
Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augmentation. First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.
Funder
Science and Technology Bureau of Changchun City
Jilin Province Development and Reform Commission
Publisher
Association for Computing Machinery (ACM)