Hate speech detection in low-resourced Indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments-Reference-Cited by-同舟云学术

Hate speech detection in low-resourced Indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments

Published:2024-08-27 Issue: Volume: Page:1-22
ISSN:2977-0424
Container-title:Natural Language Processing
language:en
Short-container-title:Nat. lang. processing

Author:

Ghosh Koyel^ORCID,Senapati Apurbalal^ORCID

Abstract

Abstract Warning: This paper is based on hate speech detection and may contain examples of abusive/ offensive phrases. Cyberbullying, online harassment, etc., via offensive comments are pervasive across different social media platforms like ™Twitter, ™Facebook, ™YouTube, etc. Hateful comments must be detected and eradicated to prevent harassment and violence on social media. In the Natural Language Processing (NLP) domain, the most prevalent task is comment classification, which is challenging, and language models based on transformers are at the forefront of this advancement. This paper intends to analyze the performance of language models based on transformers like BERT, ALBERT, RoBERTa, and DistilBERT on the Indian hate speech datasets over binary classification. Here, we utilize the existing datasets, i.e., HASOC (Hindi and Marathi) and HS-Bangla. So, we evaluate several multilingual language models like MuRIL-BERT, XLM-RoBERTa, etc., few monolingual language models like RoBERTa-Hindi, Maha-BERT (Marathi), Bangla-BERT (Bangla), Assamese-BERT (Assamese), etc., and perform cross-lingual experiment also. For further analyses, we perform multilingual, monolingual, and cross-lingual experiments on our Hate Speech Assamese (HS-Assamese) (Indo-Aryan language family) and Hate Speech Bodo (HS-Bodo) (Sino-Tibetan language family) dataset (HS dataset version 2) also and achieved a promising result. The motivation of the cross-lingual experiment is to encourage researchers to learn about the power of the transformer. Note that no pre-trained language models are currently available for Bodo or any other Sino-Tibetan languages.

Publisher

Cambridge University Press (CUP)

Reference70 articles.

1. Ramanathan, A. and Rao, D. (2003). A lightweight stemmer for Hindi.

2. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

3. Bashar, Md.A. and Nayak, R. (2020). Qutnocturnal@hasoc’19: CNN for hate speech and offensive content identification in hindi language. CoRR, abs/2008.12448.

4. Adasyn: Adaptive synthetic sampling approach for imbalanced learning;He;2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence),2008

5. Bhattacharyya, P. (2010). Indowordnet. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).