Abstract
As social media booms, abusive online practices such as hate speech have unfortunately increased as well. As letters are often repeated in words used to construct social media messages, these types of words should be eliminated or reduced in number to enhance the efficacy of hate speech detection. Although multiple models have attempted to normalize out-of-vocabulary (OOV) words with repeated letters, they often fail to determine whether the in-vocabulary (IV) replacement words are correct or incorrect. Therefore, this study developed an improved model for normalizing OOV words with repeated letters by replacing them with correct in-vocabulary (IV) replacement words. The improved normalization model is an unsupervised method that does not require the use of a special dictionary or annotated data. It combines rule-based patterns of words with repeated letters and the SymSpell spelling correction algorithm to remove repeated letters within the words by multiple rules regarding the position of repeated letters in a word, be it at the beginning, middle, or end of the word and the repetition pattern. Two hate speech datasets were then used to assess performance. The proposed normalization model was able to decrease the percentage of OOV words to 8%. Its F1 score was also 9% and 13% higher than the models proposed by two extant studies. Therefore, the proposed normalization model performed better than the benchmark studies in replacing OOV words with the correct IV replacement and improved the performance of the detection model. As such, suitable rule-based patterns can be combined with spelling correction to develop a text normalization model to correctly replace words with repeated letters, which would, in turn, improve hate speech detection in texts.
Funder
Ministry of Higher Education and Scientific Research
Publisher
Public Library of Science (PLoS)
Reference62 articles.
1. One-step and two-step classification for abusive language detection on twitter;JH Park;arXiv preprint arXiv,2017
2. An in-depth analysis of the effect of text normalization in social media. NAACL HLT 2015–2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies;T. Baldwin;Proceedings of the Conference,2015
3. Offline events and online hate;Y Lupu;PLoS one,2023
4. Detecting the hate code on social media;R Magu;InProceedings of the International AAAI Conference on Web and Social Media,2017
5. A measurement study of hate speech in social media;M Mondal;InProceedings of the 28th ACM conference on hypertext and social media,2017