Data Augmentation For Sorani Kurdish News Headline Classification Using Back-Translation And Deep Learning Model-Reference-Cited by-同舟云学术

Data Augmentation For Sorani Kurdish News Headline Classification Using Back-Translation And Deep Learning Model

Published:2023-06-30 Issue: Volume: Page:27-34
ISSN:2411-7706
Container-title:Kurdistan Journal of Applied Research
language:
Short-container-title:KJAR

Author:

Badawi Soran

Abstract

With the increase in the volume of news articles and headlines being generated, it is becoming more difficult for individuals to keep up with the latest developments and find relevant news articles in the Kurdish language. To address this issue, this paper proposes a novel data augmentation approach for improving the performance of Kurdish news headline classification using back-translation and a proposed deep learning Bidirectional Long Short-Term Memory (BiLSTM) model. The approach involves generating synthetic training data by translating Kurdish headlines into a target language in this context English language and back-translating them to the Kurdish language, resulting in an augmented dataset. The proposed BiLSTM model is trained on the augmented data and compared with baseline models SVM (Support-Vector-Machines) and Naïve Bayes an trained on the original data. The experimental results demonstrate that the proposed BiLSTM model outperforms the baseline model and other existing models, achieving state-of-the-art performance on the Kurdish news headline classification task. The findings suggest that the combination of back-translation and a proposed BiLSTM model is a promising approach for data augmentation in low-resource languages, contributing to the advancement of natural language processing in under-resourced languages. Moreover, having a Kurdish news headline classification model can improve access to news and information for Kurdish speakers. With the classification model, they can easily and quickly search for news articles that interest them based on their preferred categories, such as politics, sports, or entertainment.

Publisher

Sulaimani Polytechnic University

Subject

General Economics, Econometrics and Finance

Reference28 articles.

1. [1] B. R. Chakravarthi et al., "Detecting abusive comments at a fine-grained level in a low-resource language," Natural Language Processing Journal, vol. 3, p. 100006, Jun. 2023, doi: 10.1016/j.nlp.2023.100006.

2. [2] M. A. Hedderich, L. Lange, H. Adel, J. Strötgen, and D. Klakow, "A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios," Oct. 2020.

3. [3] C. Shorten, T. M. Khoshgoftaar, and B. Furht, "Text Data Augmentation for Deep Learning," J Big Data, vol. 8, no. 1, p. 101, Dec. 2021, doi: 10.1186/s40537-021-00492-0.

4. [4] M. Varasteh and A. Kazemi, "Using ParsBert on Augmented Data for Persian News Classification," in 2021 7th International Conference on Web Research (ICWR), IEEE, May 2021, pp. 78-81. doi: 10.1109/ICWR51868.2021.9443119.

5. [5] J. P. R. Sharami, P. A. Sarabestani, and S. A. Mirroshandel, "DeepSentiPers: Novel Deep Learning Models Trained Over Proposed Augmented Persian Sentiment Corpus," Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.05328

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Bridging the Gap;ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY;2024-04-03

2. KurdiSent: a corpus for kurdish sentiment analysis;Language Resources and Evaluation;2024-01-02