Multi-label text classification on unbalanced Twitter with monolingual model and hyperparameter optimization for hate speech and abusive language detection-Reference-Cited by-同舟云学术

Multi-label text classification on unbalanced Twitter with monolingual model and hyperparameter optimization for hate speech and abusive language detection

Published:2024-05 Issue:5 Volume:11 Page:177-185
ISSN:2313-626X
Container-title:International Journal of ADVANCED AND APPLIED SCIENCES
language:
Short-container-title:Int. j. adv. appl. sci.

Author:

,Alzahrani Ahmad A.,Bramantoro Arif^ORCID, ,Permana Asep,

Abstract

The increase in hate speech and abusive language on social media leads to uncomfortable interactions among users. Many datasets available publicly that address hate speech and abusive language are not balanced, particularly those from Indonesian Twitter. To develop a more effective classification model that also considers minority classes, we needed to optimize the hyperparameters of a monolingual model, use four different data preprocessing scenarios, and improve the treatment of slang words. We assessed the model's effectiveness by its accuracy, achieving 81.38%. This result came from optimizing hyperparameters, processing data without stemming and removing stop words, and enhancing the slang word data. The optimal hyperparameters were a learning rate of 4e-5, a batch size of 16, and a dropout rate of 0.1. However, using too much dropout can decrease the model’s performance and its ability to predict less common categories, such as physical- and gender-related hate speech.

Publisher

International Journal of Advanced and Applied Sciences

Link

https://science-gate.com/IJAAS/Articles/2024/2024-11-05/1021833ijaas202405019.pdf

Reference24 articles.

1. Alfina I, Mulia R, Fanany MI, and Ekanata Y (2017). Hate speech detection in the Indonesian language: A dataset and preliminary study. In the International Conference on Advanced Computer Science and Information Systems, IEEE, Bali, Indonesia: 233-238.

2. Bramantoro A and Virdyna I (2022). Classification of divorce causes during the COVID-19 pandemic using convolutional neural networks. PeerJ Computer Science, 8: e998.

3. El Kafrawy P, Mausad A, and Esmail H (2015). Experimental comparison of methods for multi-label classification in different application domains. International Journal of Computer Applications, 114: 19.

4. Fernández A, García S, Galar M, Prati RC, Krawczyk B, and Herrera F (2018). Learning from imbalanced data sets. Springer, Berlin/Heidelberg, Germany. https://doi.org/10.1007/978-3-319-98074-4

5. Hana KM, Al Faraby S, and Bramantoro A (2020). Multi-label classification of Indonesian hate speech on Twitter using support vector machines. In the International Conference on Data Science and Its Applications, IEEE, Bandung, Indonesia: 1-7.