Affiliation:
1. Department of Computer Engineering, Adana Alparslan Turkes Science and Technology University, Adana, Turkey
2. Department of Computer Engineering, Kahramanmaraş Sütçü İmam University, Kahramanmaraş, Turkey
Abstract
Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.
Publisher
Association for Computing Machinery (ACM)
Reference63 articles.
1. Publicly Available Clinical
2. Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches
3. Somnath Banerjee Maulindu Sarkar Nancy Agrawal Punyajoy Saha and Mithun Das. 2021. Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages. In Forum for Information Retrieval Evaluation (Working Notes)(FIRE) CEUR-WS. org.
4. Md Abul Bashar and Richi Nayak. 2020. QutNocturnal@ HASOC’19: CNN for hate speech and offensive content identification in Hindi language. arXiv preprint arXiv:2008.12448(2020).
5. Jasmijn Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1957–1967.