Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter-Reference-Cited by-同舟云学术

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

Published:2020-12-08 Issue:2 Volume:32 Page:
ISSN:2313-7835
Container-title:South African Computer Journal
language:
Short-container-title:SACJ

Author:

Oriola Oluwafemi^ORCID,Kotzé Eduan^ORCID

Abstract

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

Publisher

South African Institute of Computer Scientists and Information Technologists

Subject

Computer Networks and Communications,Computer Science Applications,Human-Computer Interaction,Education,Information Systems

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving the Detection of Multilingual South African Abusive Language via Skip-gram using Joint Multilevel Domain Adaptation;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-12-28

2. Cyberbullying detection for low-resource languages and dialects: Review of the state of the art;Information Processing & Management;2023-09

3. FALCoN: Detecting and classifying abusive language in social networks using context features and unlabeled data;Information Processing & Management;2023-07