MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data-Reference-Cited by-同舟云学术

MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data

Published:2022-11-01 Issue:21 Volume:11 Page:3574
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Ganganwar Vaishali,Rajalakshmi Ratnavel^ORCID

Abstract

The posting of offensive content in regional languages has increased as a result of the accessibility of low-cost internet and the widespread use of online social media. Despite the large number of comments available online, only a small percentage of them are offensive, resulting in an unequal distribution of offensive and non-offensive comments. Due to this class imbalance, classifiers may be biased toward the class with the most samples, i.e., the non-offensive class. To address class imbalance, a Multilingual Translation-based Data augmentation technique for Offensive content identification in Tamil text data (MTDOT) is proposed in this work. The proposed MTDOT method is applied to HASOC’21, which is the Tamil offensive content dataset. To obtain a balanced dataset, each offensive comment is augmented using multi-level back translation with English and Malayalam as intermediate languages. Another balanced dataset is generated by employing single-level back translation with Malayalam, Kannada, and Telugu as intermediate languages. While both approaches are equally effective, the proposed multi-level back-translation data augmentation approach produces more diverse data, which is evident from the BLEU score. The MTDOT technique proposed in this work achieved a promising improvement in F1-score over the widely used SMOTE class balancing method by 65%.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/11/21/3574/pdf

Reference24 articles.

1. Rajalakshmi, R., and Reddy, B.Y. DLRG@HASOC 2019: An Enhanced Ensemble Classifier for Hate and Offensive Content Identification. Proceedings of the Working Notes of FIRE 2019—Forum for Information Retrieval Evaluation, Volume 2517.

2. B, Y.R., and Rajalakshmi, R. DLRG@HASOC 2020: A Hybrid Approach for Hate and Offensive Content Identification in Multilingual Tweets. Proceedings of the Working Notes of FIRE 2020—Forum for Information Retrieval Evaluation, CEUR Workshop Proceedings, Volume 2826.

3. Rajalakshmi, R., Reddy, P., Khare, S., and Ganganwar, V. Sentimental Analysis of Code-Mixed Hindi Language. Proceedings of the Congress on Intelligent Systems, 2022.

4. Chakravarthi, B.R., Kumaresan, P.K., Sakuntharaj, R., Madasamy, A.K., Thavareesan, S., B, P., Chinnaudayar Navaneethakrishnan, S., McCrae, J.P., and Mandl, T. Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam. Proceedings of the Working Notes of FIRE 2021—Forum for Information Retrieval Evaluation, CEUR.

5. Corbeil, J.P., and Ghadivel, H.A. Bet: A backtranslation approach for easy data augmentation in transformer-based paraphrase identification context. arXiv, 2020.

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Using Website Content for Detecting Phishing URLs: A Novel Approach;Lecture Notes in Networks and Systems;2024

2. Abusive comment detection in Tamil using deep learning;Computational Intelligence Methods for Sentiment Analysis in Natural Language Processing Applications;2024

3. Enhanced Hindi Aspect-based Sentiment Analysis using Class Balancing Approach;International Journal of Information Technology;2023-09-05

4. Tamil NLP Technologies: Challenges, State of the Art, Trends and Future Scope;Communications in Computer and Information Science;2023

5. Context Sensitive Tamil Language Spellchecker Using RoBERTa;Communications in Computer and Information Science;2023