Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection

Author:

Saifullah Shoffan12ORCID,Dreżewski Rafał13ORCID,Dwiyanto Felix Andika14ORCID,Aribowo Agus Sasmito2,Fauziah Yuli2,Cahyana Nur Heri2

Affiliation:

1. Faculty of Computer Science, AGH University of Krakow, 30-059 Krakow, Poland

2. Department of Informatics, Universitas Pembangunan Nasional Veteran Yogyakarta, Yogyakarta 55283, Indonesia

3. Artificial Intelligence Research Group (AIRG), Informatics Department, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta 55166, Indonesia

4. Department of Electrical Engineering, Universitas Negeri Malang, Malang 65145, Indonesia

Abstract

Text annotation is an essential element of the natural language processing approaches. The manual annotation process performed by humans has various drawbacks, such as subjectivity, slowness, fatigue, and possibly carelessness. In addition, annotators may annotate ambiguous data. Therefore, we have developed the concept of automated annotation to get the best annotations using several machine-learning approaches. The proposed approach is based on an ensemble algorithm of meta-learners and meta-vectorizer techniques. The approach employs a semi-supervised learning technique for automated annotation to detect hate speech. This involves leveraging various machine learning algorithms, including Support Vector Machine (SVM), Decision Tree (DT), K-Nearest Neighbors (KNN), and Naive Bayes (NB), in conjunction with Word2Vec and TF-IDF text extraction methods. The annotation process is performed using 13,169 Indonesian YouTube comments data. The proposed model used a Stemming approach using data from Sastrawi and new data of 2245 words. Semi-supervised learning uses 5%, 10%, and 20% of labeled data compared to performing labeling based on 80% of the datasets. In semi-supervised learning, the model learns from the labeled data, which provides explicit information, and the unlabeled data, which offers implicit insights. This hybrid approach enables the model to generalize and make informed predictions even when limited labeled data is available (based on self-learning). Ultimately, this enhances its ability to handle real-world scenarios with scarce annotated information. In addition, the proposed method uses a variety of thresholds for matching words labeled with hate speech ranging from 0.6, 0.7, 0.8, to 0.9. The experiments indicated that the DT-TF-IDF model has the best accuracy value of 97.1% with a scenario of 5%:80%:0.9. However, several other methods have accuracy above 90%, such as SVM (TF-IDF and Word2Vec) and KNN (Word2Vec), based on both text extraction methods in several test scenarios.

Publisher

MDPI AG

Reference59 articles.

1. Alrehili, A. (2019, January 3–7). Automatic Hate Speech Detection on Social Media: A Brief Survey. Proceedings of the 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA), Abu Dhabi, United Arab Emirates.

2. Automatic hate speech detection using killer natural language processing optimizing ensemble deep learning approach;Tolba;Computing,2019

3. Rajman, M., and Besançon, R. (1998). Data Mining and Reverse Engineering, Springer.

4. A Survey on Automatic Detection of Hate Speech in Text;Fortuna;ACM Comput. Surv.,2018

5. Semi-supervised Text Annotation for Hate Speech Detection using K-Nearest Neighbors and Term Frequency-Inverse Document Frequency;Cahyana;Int. J. Adv. Comput. Sci. Appl.,2022

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3