Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation

Author:

Badri Nabil1ORCID,Kboubi Ferihane2ORCID,Habacha Chaibi Anja2ORCID

Affiliation:

1. Science computer, University of Manouba-National School of Computer Science-RIADI Laboratory, Manouba, Tunisia

2. Science computer, University of Manouba-National School of Computer Science-RIADI Laboratory, Manouba Tunisia

Abstract

Hateful content on social media is a worldwide problem that adversely affects not just the targeted individuals but also anyone whose content is accessible. The majority of studies that looked at the automatic identification of inappropriate content addressed the English language, given the availability of resources. Therefore, there are still a number of low-resource languages that need more attention from the community. This paper focuses on the Arabic dialect, which has several specificities that make the use of non-Arabic models inappropriate. Our hypothesis is that leveraging pre-trained language models (PLMs) specifically designed for Arabic, along with data augmentation techniques, can significantly enhance the detection of hate speech in Arabic mono/multi-dialect texts. To test this hypothesis, we conducted a series of experiments addressing three key research questions: (RQ1) Does text augmentation enhance the final results compared to using an unaugmented dataset? (RQ2) Do Arabic PLMs outperform other models utilizing techniques such as fastText and AraVec word embeddings? (RQ3) Does training and fine-tuning models on a multilingual dataset yield better results than training them on a monolingual dataset? Our methodology involved the comparison of PLMs based on transfer learning, specifically examining the performance of DziriBERT, AraBERT v2, and Bert-base-arabic models. We implemented text augmentation techniques and evaluated their impact on model performance. The tools used included fastText and AraVec for word embeddings, as well as various PLMs for transfer learning. The results demonstrate a notable improvement in classification accuracy, with augmented datasets showing an increase in performance metrics (accuracy, precision, recall, and F1-score) by up to 15-21% compared to non-augmented datasets. This underscores the potential of data augmentation in enhancing the models’ ability to generalize across the nuanced spectrum of Arabic dialects.

Publisher

Association for Computing Machinery (ACM)

Reference75 articles.

1. Hate speech review in the context of online social networks

2. Amine Abdaoui Mohamed Berrimi Mourad Oussalah and Abdelouahab Moussaoui. 2021. DziriBERT: a Pre-trained Language Model for the Algerian Dialect. arXiv preprint arXiv:2109.12346(2021).

3. Kareem E Abdelfatah, Gabriel Terejanu, Ayman A Alhelbawy, et al. 2017. Unsupervised detection of violent content in arabic social media. Computer Science & Information Technology (CS & IT) 7 (2017).

4. Muhammad Abdul-Mageed AbdelRahim Elmadany and El Moatez Billah Nagoudi. 2020. ARBERT & MARBERT: deep bidirectional transformers for Arabic. arXiv preprint arXiv:2101.01785(2020).

5. Arabic sentiment analysis: Lexicon-based and corpus-based

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3