Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

Author:

Biradar Shankar1,Saumya Sunil2,Kumar Abhinav3,Singh Ashish4

Affiliation:

1. Department of CSE, IIIT Dharwad, India

2. Department of Data Science and Intelligent System, IIIT Dharwad, India

3. Department of CSE, IIIT Surat, India

4. School of Computer Engineering, KIIT University, India

Abstract

In multilingual societies like India, mixing the native language with English has become common during social media conversations. Further, due to the government’s digitization push, more people from rural India are joining social media platforms, resulting in the exponential growth of native or code-mixed content. The resultant content on social media is available for both positive (also termed as Hope Speech) as well as negative context (also termed as Hate Speech). To keep the social media clean and hate free, it is important to remove the negative content using machine learning filters. Since most of the existing hate content prediction models are trained using high resource language such as English, they fail to work on code-mixed text due to its spelling variance and non-grammatical structure. In addition, the lack of suitable training data could be one reason behind existing models’ poor performance on code-mixed text. To address these issues and promote research in this direction, we developed a manually annotated Hinglish Code-mixed corpus of 9254 comments taken from Twitter handles. We also annotated our data with the target audience and severity level. In each label, we provided a more fine-grained classification with three independent classes, and we built a Multi-label and Multi-class corpus for the severity of hate content prediction in Hinglish code-mixed text. Further, we modeled various supervised classifiers for severity prediction to validate our proposed data. The proposed models employ transformers for feature extraction and different machine learning and RNN (Recurrent neural network) models for classification. According to the experimental results, the target label combined with embeddings from Twitter text using the BiLSTM (a varient of RNN) classifier performed better on severity prediction, attaining an acceptable weighted F1 score.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference48 articles.

1. Swati Agarwal and Ashish Sureka. 2017. Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging website. arXiv preprint arXiv:1701.04931(2017). Swati Agarwal and Ashish Sureka. 2017. Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging website. arXiv preprint arXiv:1701.04931(2017).

2. Deep Learning for Hate Speech Detection in Tweets

3. Nitin Nikamanth Appiah Balaji and B Bharathi. 2020. SSNCSE_NLP@ HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification on Multilingual Code Mixing Text.. In FIRE (Working Notes). 370–376. Nitin Nikamanth Appiah Balaji and B Bharathi. 2020. SSNCSE_NLP@ HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification on Multilingual Code Mixing Text.. In FIRE (Working Notes). 370–376.

4. Mohit Bhardwaj Md Shad Akhtar Asif Ekbal Amitava Das and Tanmoy Chakraborty. 2020. Hostility detection dataset in hindi. arXiv preprint arXiv:2011.03588(2020). Mohit Bhardwaj Md Shad Akhtar Asif Ekbal Amitava Das and Tanmoy Chakraborty. 2020. Hostility detection dataset in hindi. arXiv preprint arXiv:2011.03588(2020).

5. Shankar Biradar , Sunil Saumya , et al . 2022 . Fighting hate speech from bilingual hinglish speaker’s perspective, a transformer-and translation-based approach.Social Network Analysis and Mining 12, 1 (2022), 1–10. Shankar Biradar, Sunil Saumya, et al. 2022. Fighting hate speech from bilingual hinglish speaker’s perspective, a transformer-and translation-based approach.Social Network Analysis and Mining 12, 1 (2022), 1–10.

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3