Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language-Reference-Cited by-同舟云学术

Pradvis vac: A socio-demographic dataset for determining the level of hatred severity in a low-resource Hinglish language

Published:2022-12-07 Issue: Volume: Page:
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Biradar Shankar¹,Saumya Sunil²,Kumar Abhinav³,Singh Ashish⁴

Affiliation:

1. Department of CSE, IIIT Dharwad, India

2. Department of Data Science and Intelligent System, IIIT Dharwad, India

3. Department of CSE, IIIT Surat, India

4. School of Computer Engineering, KIIT University, India

Abstract

In multilingual societies like India, mixing the native language with English has become common during social media conversations. Further, due to the government’s digitization push, more people from rural India are joining social media platforms, resulting in the exponential growth of native or code-mixed content. The resultant content on social media is available for both positive (also termed as Hope Speech) as well as negative context (also termed as Hate Speech). To keep the social media clean and hate free, it is important to remove the negative content using machine learning filters. Since most of the existing hate content prediction models are trained using high resource language such as English, they fail to work on code-mixed text due to its spelling variance and non-grammatical structure. In addition, the lack of suitable training data could be one reason behind existing models’ poor performance on code-mixed text. To address these issues and promote research in this direction, we developed a manually annotated Hinglish Code-mixed corpus of 9254 comments taken from Twitter handles. We also annotated our data with the target audience and severity level. In each label, we provided a more fine-grained classification with three independent classes, and we built a Multi-label and Multi-class corpus for the severity of hate content prediction in Hinglish code-mixed text. Further, we modeled various supervised classifiers for severity prediction to validate our proposed data. The proposed models employ transformers for feature extraction and different machine learning and RNN (Recurrent neural network) models for classification. According to the experimental results, the target label combined with embeddings from Twitter text using the BiLSTM (a varient of RNN) classifier performed better on severity prediction, attaining an acceptable weighted F1 score.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3573199

Reference48 articles.

1. Swati Agarwal and Ashish Sureka. 2017. Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging website. arXiv preprint arXiv:1701.04931(2017). Swati Agarwal and Ashish Sureka. 2017. Characterizing linguistic attributes for automatic classification of intent based racist/radicalized posts on tumblr micro-blogging website. arXiv preprint arXiv:1701.04931(2017).

2. Deep Learning for Hate Speech Detection in Tweets

3. Nitin Nikamanth Appiah Balaji and B Bharathi. 2020. SSNCSE_NLP@ HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification on Multilingual Code Mixing Text.. In FIRE (Working Notes). 370–376. Nitin Nikamanth Appiah Balaji and B Bharathi. 2020. SSNCSE_NLP@ HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification on Multilingual Code Mixing Text.. In FIRE (Working Notes). 370–376.

4. Mohit Bhardwaj Md Shad Akhtar Asif Ekbal Amitava Das and Tanmoy Chakraborty. 2020. Hostility detection dataset in hindi. arXiv preprint arXiv:2011.03588(2020). Mohit Bhardwaj Md Shad Akhtar Asif Ekbal Amitava Das and Tanmoy Chakraborty. 2020. Hostility detection dataset in hindi. arXiv preprint arXiv:2011.03588(2020).

5. Shankar Biradar , Sunil Saumya , et al . 2022 . Fighting hate speech from bilingual hinglish speaker’s perspective, a transformer-and translation-based approach.Social Network Analysis and Mining 12, 1 (2022), 1–10. Shankar Biradar, Sunil Saumya, et al. 2022. Fighting hate speech from bilingual hinglish speaker’s perspective, a transformer-and translation-based approach.Social Network Analysis and Mining 12, 1 (2022), 1–10.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Domain Specific Embeddings in RNN Frameworks for Hate Span Detection and Classification;2023 4th International Conference on Intelligent Technologies (CONIT);2024-06-21

2. Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text;Language Resources and Evaluation;2024-04-16

3. Explainable Deep Learning for Mental Health Detection from English and Arabic Social Media Posts;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-11-21