Abusive Language Detection in Khasi Social Media Comments

Author:

Baruah Arup12ORCID,Wahlang Lakhamti2ORCID,Jyrwa Firstbornson2ORCID,Shadap Floriginia2ORCID,Barbhuiya Ferdous1ORCID,Dey Kuntal31ORCID

Affiliation:

1. Computer Science and Engineering, Indian Institute of Information Technology Guwahati, Guwahati, India

2. Computer Science and Engineering, Assam Don Bosco University, Guwahati, India

3. Accenture Labs, Bangalore, India

Abstract

This paper describes the work performed for automated abusive language detection in the Khasi language, a low-resource language spoken primarily in the state of Meghalaya, India. A dataset named Khasi Abusive Language Dataset (KALD) was created which consists of 4,573 human-annotated Khasi YouTube and Facebook comments. A corpus of Khasi text was built and it was used to create Khasi word2vec and fastText word embeddings. Deep learning, traditional machine learning, and ensemble models were used in the study. Experiments were performed using word2vec, fastText, and topic vectors obtained using LDA. Experiments were also performed to check if zero-shot cross-lingual nature of language models such as LaBSE and LASER can be utilized for abusive language detection in the Khasi language. The best F1 score of 0.90725 was obtained by an XGBoost classifier. After feature selection and rebalancing of the dataset, F1 score of 0.91828 and 0.91945 were obtained by an SVM based classifiers.

Publisher

Association for Computing Machinery (ACM)

Reference87 articles.

1. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

2. Deep Learning for Hate Speech Detection in Tweets

3. ABARUAH at SemEval-2019 Task 5 : Bi-directional LSTM for Hate Speech Detection

4. Arup Baruah Ferdous A. Barbhuiya and Kuntal Dey. 2019. IIITG-ADBU at HASOC 2019: Automated Hate Speech and Offensive Content Detection in English and Code-Mixed Hindi Text.. In FIRE (Working Notes). 229–236.

5. Latent dirichlet allocation;Blei M;Journal of machine Learning research 3,2003

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3