Hebrew offensive language taxonomy and dataset

Author:

Liebeskind Chaya1ORCID,Vanetik Natalia2ORCID,Litvak Marina2ORCID

Affiliation:

1. Jerusalem College of Technology, Department of Computer Science 21 Havaad Haleumi St ., Jerusalem , Israel

2. Shamoon College of Engineering, Department of Software Engineering 56 Bialik st . Beer Sheva , Israel

Abstract

Abstract This paper introduces a streamlined taxonomy for categorizing offensive language in Hebrew, addressing a gap in the literature that has, until now, largely focused on Indo-European languages. Our taxonomy divides offensive language into seven levels (six explicit and one implicit level). We based our work on the simplified offensive language (SOL) taxonomy introduced in (Lewandowska-Tomaszczyk et al. 2021a) hoping that our adjustment of SOL to the Hebrew language will be capable of reflecting the unique linguistic and cultural nuances of Hebrew. The study involves both linguistic and cultural analysis beyond Natural Language Processing (NLP). We employed manual linguistic analysis to understand the nuances of offensive language in Hebrew. An accompanying dataset, gathered on Twitter and manually curated by human annotators, is described in detail. This dataset was constructed to both validate the taxonomy and serve as a foundation for future research on offensive language detection and analysis in Hebrew. Preliminary analysis of the dataset reveals intriguing patterns and distributions, underscoring the complexity and specificity of offensive expressions in the Hebrew language. The aim of our work is to capture the complexity and specificity of offensive expressions in Hebrew beyond what automated NLP methods alone can provide. Our findings highlight the significance of considering linguistic and cultural variations when researching and correcting abusive language online. We believe that our streamlined taxonomy and associated dataset will be crucial in improving research in Hebrew language sociocultural studies, natural language processing, and offensive language detection. Our study also makes a substantial contribution to the study of low-resource languages and can be used as a model for future research on other languages.

Publisher

Walter de Gruyter GmbH

Subject

Linguistics and Language,Communication,Language and Linguistics

Reference46 articles.

1. Belkina, Anna C, Christopher O. Ciccolella, Rina Anno, Richard Halpert, Josef Spidlen & Jennifer E. Snyder-Cappione. 2019. Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature communications 10(1). 5415.

2. Bojanowski, Piotr, Edouard Grave, Armand Joulin & Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics 5. 135–146.

3. Bright, J. 2022. History under attack: Holocaust denial and distortion on social media. Supporting Data. United Nations Educational, Scientific and Cultural Organization (UNESCO), Paris, France, and the United Nations Department of Global Communications, United Nations, New York, USA.

4. Caselli, Tommaso, Valerio Basile, Jelena Mitrovic, Inga Kartoziya & Michael Granitzer. 2020. I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In Proceedings of the twelfth language resources and evaluation conference, 6193–6202. The European Language Resources Association (ELRA), Marseille, France.

5. Chiril, Patricia, Farah Benamara, Véronique Moriceau, Marlene Coulomb-Gully & Abhishek Kumar. 2019. Multilingual and multitarget hate speech detection in tweets. In Conférence sur le traitement automatique des langues naturelles (TALN-PFIA 2019), 351–360. Toulouse, France, ATALA.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3