MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data

Author:

Ganganwar Vaishali,Rajalakshmi RatnavelORCID

Abstract

The posting of offensive content in regional languages has increased as a result of the accessibility of low-cost internet and the widespread use of online social media. Despite the large number of comments available online, only a small percentage of them are offensive, resulting in an unequal distribution of offensive and non-offensive comments. Due to this class imbalance, classifiers may be biased toward the class with the most samples, i.e., the non-offensive class. To address class imbalance, a Multilingual Translation-based Data augmentation technique for Offensive content identification in Tamil text data (MTDOT) is proposed in this work. The proposed MTDOT method is applied to HASOC’21, which is the Tamil offensive content dataset. To obtain a balanced dataset, each offensive comment is augmented using multi-level back translation with English and Malayalam as intermediate languages. Another balanced dataset is generated by employing single-level back translation with Malayalam, Kannada, and Telugu as intermediate languages. While both approaches are equally effective, the proposed multi-level back-translation data augmentation approach produces more diverse data, which is evident from the BLEU score. The MTDOT technique proposed in this work achieved a promising improvement in F1-score over the widely used SMOTE class balancing method by 65%.

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Reference24 articles.

1. Rajalakshmi, R., and Reddy, B.Y. DLRG@HASOC 2019: An Enhanced Ensemble Classifier for Hate and Offensive Content Identification. Proceedings of the Working Notes of FIRE 2019—Forum for Information Retrieval Evaluation, Volume 2517.

2. B, Y.R., and Rajalakshmi, R. DLRG@HASOC 2020: A Hybrid Approach for Hate and Offensive Content Identification in Multilingual Tweets. Proceedings of the Working Notes of FIRE 2020—Forum for Information Retrieval Evaluation, CEUR Workshop Proceedings, Volume 2826.

3. Rajalakshmi, R., Reddy, P., Khare, S., and Ganganwar, V. Sentimental Analysis of Code-Mixed Hindi Language. Proceedings of the Congress on Intelligent Systems, 2022.

4. Chakravarthi, B.R., Kumaresan, P.K., Sakuntharaj, R., Madasamy, A.K., Thavareesan, S., B, P., Chinnaudayar Navaneethakrishnan, S., McCrae, J.P., and Mandl, T. Overview of the HASOC-DravidianCodeMix Shared Task on Offensive Language Detection in Tamil and Malayalam. Proceedings of the Working Notes of FIRE 2021—Forum for Information Retrieval Evaluation, CEUR.

5. Corbeil, J.P., and Ghadivel, H.A. Bet: A backtranslation approach for easy data augmentation in transformer-based paraphrase identification context. arXiv, 2020.

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Using Website Content for Detecting Phishing URLs: A Novel Approach;Lecture Notes in Networks and Systems;2024

2. Abusive comment detection in Tamil using deep learning;Computational Intelligence Methods for Sentiment Analysis in Natural Language Processing Applications;2024

3. Enhanced Hindi Aspect-based Sentiment Analysis using Class Balancing Approach;International Journal of Information Technology;2023-09-05

4. Tamil NLP Technologies: Challenges, State of the Art, Trends and Future Scope;Communications in Computer and Information Science;2023

5. Context Sensitive Tamil Language Spellchecker Using RoBERTa;Communications in Computer and Information Science;2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3