Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages-Reference-Cited by-同舟云学术

Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages

Published:2024-03-29 Issue: Volume:10 Page:e1974
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Ziyaden Atabay¹²,Yelenov Amir²³^ORCID,Hajiyev Fuad⁴,Rustamov Samir⁴,Pak Alexandr¹²

Affiliation:

1. Kazakh-British Technical University, Almaty, Kazakhstan

2. Institute of Information and Computational Technologies, Almaty, Kazakhstan

3. Nazarbayev University, Astana, Kazakhstan

4. School of Information Technologies and Engineering, ADA University, Baku, Azerbaijan

Abstract

Background In the domain of natural language processing (NLP), the development and success of advanced language models are predominantly anchored in the richness of available linguistic resources. Languages such as Azerbaijani, which is classified as a low-resource, often face challenges arising from limited labeled datasets, consequently hindering effective model training. Methodology The primary objective of this study was to enhance the effectiveness and generalization capabilities of news text classification models using text augmentation techniques. In this study, we solve the problem of working with low-resource languages using translations using the Facebook mBart50 model, as well as the Google Translate API and a combination of mBart50 and Google Translate thus expanding the capabilities when working with text. Results The experimental outcomes reveal a promising uptick in classification performance when models are trained on the augmented dataset compared with their counterparts using the original data. This investigation underscores the immense potential of combined data augmentation strategies to bolster the NLP capabilities of underrepresented languages. As a result of our research, we have published our labeled text classification dataset and pre-trained RoBERTa model for the Azerbaijani language.

Funder

Ministry of Education and Sciences of the Republic of Kazakhstan

Publisher

PeerJ

Link

https://peerj.com/articles/cs-1974.pdf

Reference34 articles.

1. Effective use of augmentation degree and language model for synonym-based text augmentation on indonesian text classification;Purwarianti,2019

2. Toward text data augmentation for sentiment analysis;Abonizio;IEEE Transactions on Artificial Intelligence,2022

3. Azertac news dataset;AzerTac;Zenodo,2023

4. Azertac state agency;Azertac,2023

5. Extractive summarization for explainable sentiment analysis using transformers;Bacco,2021