Affiliation:
1. Jerusalem College of Technology, Department of Computer Science 21 Havaad Haleumi St ., Jerusalem , Israel
2. Shamoon College of Engineering, Department of Software Engineering 56 Bialik st . Beer Sheva , Israel
Abstract
Abstract
This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (Lewandowska-Tomaszczyk et al. 2021a) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew.
An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language.
The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.
Subject
Linguistics and Language,Communication,Language and Linguistics
Reference46 articles.
1. Belkina, Anna C, Christopher O. Ciccolella, Rina Anno, Richard Halpert, Josef Spidlen & Jennifer E. Snyder-Cappione. 2019. Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature communications 10(1). 5415.
2. Bojanowski, Piotr, Edouard Grave, Armand Joulin & Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5. 135–146.
3. Bright, J. 2022. History under attack: Holocaust denial and distortion on social media. Supporting Data. United Nations Educational, Scientific and Cultural Organization (UNESCO), Paris, France, and the United Nations Department of Global Communications, United Nations, New York, USA.
4. Caselli, Tommaso, Valerio Basile, Jelena Mitrovic, Inga Kartoziya & Michael Granitzer. 2020. I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In Proceedings of the twelfth language resources and evaluation conference, 6193–6202. The European Language Resources Association (ELRA), Marseille, France.
5. Chiril, Patricia, Farah Benamara, Véronique Moriceau, Marlene Coulomb-Gully & Abhishek Kumar. 2019. Multilingual and multitarget hate speech detection in tweets. In Conférence sur le traitement automatique des langues naturelles (TALN-PFIA 2019), 351–360. Toulouse, France, ATALA.