Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation-Reference-Cited by-同舟云学术

Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation

Published:2024-08-03 Issue: Volume: Page:
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Badri Nabil¹^ORCID,Kboubi Ferihane²^ORCID,Habacha Chaibi Anja²^ORCID

Affiliation:

1. Science computer, University of Manouba-National School of Computer Science-RIADI Laboratory, Manouba, Tunisia

2. Science computer, University of Manouba-National School of Computer Science-RIADI Laboratory, Manouba Tunisia

Abstract

Hateful content on social media is a worldwide problem that adversely affects not just the targeted individuals but also anyone whose content is accessible. The majority of studies that looked at the automatic identification of inappropriate content addressed the English language, given the availability of resources. Therefore, there are still a number of low-resource languages that need more attention from the community. This paper focuses on the Arabic dialect, which has several specificities that make the use of non-Arabic models inappropriate. Our hypothesis is that leveraging pre-trained language models (PLMs) specifically designed for Arabic, along with data augmentation techniques, can significantly enhance the detection of hate speech in Arabic mono/multi-dialect texts. To test this hypothesis, we conducted a series of experiments addressing three key research questions: (RQ1) Does text augmentation enhance the final results compared to using an unaugmented dataset? (RQ2) Do Arabic PLMs outperform other models utilizing techniques such as fastText and AraVec word embeddings? (RQ3) Does training and fine-tuning models on a multilingual dataset yield better results than training them on a monolingual dataset? Our methodology involved the comparison of PLMs based on transfer learning, specifically examining the performance of DziriBERT, AraBERT v2, and Bert-base-arabic models. We implemented text augmentation techniques and evaluated their impact on model performance. The tools used included fastText and AraVec for word embeddings, as well as various PLMs for transfer learning. The results demonstrate a notable improvement in classification accuracy, with augmented datasets showing an increase in performance metrics (accuracy, precision, recall, and F1-score) by up to 15-21% compared to non-augmented datasets. This underscores the potential of data augmentation in enhancing the models’ ability to generalize across the nuanced spectrum of Arabic dialects.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3679049

Reference75 articles.

1. Hate speech review in the context of online social networks

2. Amine Abdaoui Mohamed Berrimi Mourad Oussalah and Abdelouahab Moussaoui. 2021. DziriBERT: a Pre-trained Language Model for the Algerian Dialect. arXiv preprint arXiv:2109.12346(2021).

3. Kareem E Abdelfatah, Gabriel Terejanu, Ayman A Alhelbawy, et al. 2017. Unsupervised detection of violent content in arabic social media. Computer Science & Information Technology (CS & IT) 7 (2017).

4. Muhammad Abdul-Mageed AbdelRahim Elmadany and El Moatez Billah Nagoudi. 2020. ARBERT & MARBERT: deep bidirectional transformers for Arabic. arXiv preprint arXiv:2101.01785(2020).

5. Arabic sentiment analysis: Lexicon-based and corpus-based