A Transformer Based Approach for Abuse Detection in Code Mixed Indic Languages.

Author:

Bansal Vibhuti,Tyagi Mrinal1,Sharma Rajesh2,Gupta Vedika3,Xin Qin4

Affiliation:

1. Bharati Vidyapeeth’s College of Engineering, India

2. Institute of Computer Science, University of Tartu, Estonia

3. Jindal Global Business School, O.P. Jindal Global University, India

4. Faculty of Science and Technology, University of the Faroe Islands, Faroe Islands

Abstract

The advancement in the number of online social media platforms has entailed active participation from the web users globally. This has also lead to subsequent increase in the cyberbullying cases online. Such incidents diminish an individual’s reputation or defame a community, also posing a threat to the privacy of users in cyberspace. Traditionally, manual checks and handling mechanisms have been used to deal with such textual content. However, an automatic computer-based approach would provide far better solutions to this problem. Existing approaches to automate this task majorly involves classical machine learning models which tend to perform poorly on low resource languages. Owing to the varied background and language of web users, the cyberspace witnesses the presence of multilingual text. An integrated approach to accommodate multilingual text could be the appropriate solution. This paper explores various methods to detect abusive content in 13 Indic code-mixed languages. Firstly, baseline classical machine learning models are compared with Transformer based architecture. Secondly, the paper presents the experimental analysis of four state-of-the-art transformer-based models vis à vis XLM-RoBERTa, indic-BERT, MurilBert and mBERT, out of which XLM Roberta with BiGRU outperforms. Thirdly, the experimental setup of the best performing model XLM-RoBERTa is fed with emoji embeddings that leads to further enhancement of overall performance of the employed model. Finally, the model is trained with the combined dataset of 13 Indic languages, to compare its performance with those of individual language models. The performance of combined model surpassed those of the individual models in terms of F1 score and accuracy, supporting the fact that combined model fits the data better possibly due to its code-mixed nature. This model reports a F1 score of 0.88 on test data while rendering a training loss of 0.28, validation loss of 0.31 and an AUC score of 0.94 for both training and validation.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference29 articles.

1. Imran Awan . 2016. Islamophobia on Social Media: A Qualitative Analysis of the Facebook’s Walls of Hate.International Journal of Cyber Criminology 10, 1 ( 2016 ). Imran Awan. 2016. Islamophobia on Social Media: A Qualitative Analysis of the Facebook’s Walls of Hate.International Journal of Cyber Criminology 10, 1 (2016).

2. Somnath Banerjee Maulindu Sarkar Nancy Agrawal Punyajoy Saha and Mithun Das. 2021. Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages. arXiv preprint arXiv:2111.13974(2021). Somnath Banerjee Maulindu Sarkar Nancy Agrawal Punyajoy Saha and Mithun Das. 2021. Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages. arXiv preprint arXiv:2111.13974(2021).

3. Regulating hate speech online

4. Mehar Bhatia Tenzin Singhay Bhotia Akshat Agarwal Prakash Ramesh Shubham Gupta Kumar Shridhar Felix Laumann and Ayushman Dash. 2021. One to rule them all: Towards Joint Indic Language Hate Speech Detection. arXiv preprint arXiv:2109.13711(2021). Mehar Bhatia Tenzin Singhay Bhotia Akshat Agarwal Prakash Ramesh Shubham Gupta Kumar Shridhar Felix Laumann and Ayushman Dash. 2021. One to rule them all: Towards Joint Indic Language Hate Speech Detection. arXiv preprint arXiv:2109.13711(2021).

5. Grant Blank and Christoph Lutz . 2018. Benefits and harms from Internet use: A differentiated analysis of Great Britain . New Media & Society 20(02 2018 ), 618–640. https://doi.org/10.1177/1461444816667135 10.1177/1461444816667135 Grant Blank and Christoph Lutz. 2018. Benefits and harms from Internet use: A differentiated analysis of Great Britain. New Media & Society 20(02 2018), 618–640. https://doi.org/10.1177/1461444816667135

Cited by 9 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Sentimental impact of fake news on social media using an integrated ensemble framework;Social Network Analysis and Mining;2024-09-09

2. Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languages;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-08-16

3. Utilizing Deep Learning for Textual Classification of Hate Speech in Online Social Networks;2023 4th International Conference on Intelligent Technologies (CONIT);2024-06-21

4. Which words are important?: an empirical study of Assamese sentiment analysis;Language Resources and Evaluation;2024-06-19

5. Machine Learning Models for Maternal Health Risk Prediction based on Clinical Data;2024 11th International Conference on Computing for Sustainable Global Development (INDIACom);2024-02-28

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3