Affiliation:
1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Abstract
Offensive language in social media affects the social experience of individuals and groups and hurts social harmony and moral values. Therefore, in recent years, the problem of offensive language detection has attracted the attention of many researchers. However, the primary research currently focuses on detecting English offensive language, while few studies on the Chinese language exist. In this paper, we propose an innovative approach to detect Chinese offensive language. First, unlike previous approaches, we utilized both RoBERTa’s sentence-level and word-level embedding, combining the sentence embedding and word embedding of RoBERTa’s model, bidirectional GRU, and multi-head self-attention mechanism. This feature fusion allows the model to consider sentence-level and word-level semantic information at the same time so as to capture the semantic information of Chinese text more comprehensively. Second, by concatenating the output results of multi-head attention with RoBERTa’s sentence embedding, we achieved an efficient fusion of local and global information and improved the representation ability of the model. The experiments showed that the proposed model achieved 82.931% accuracy and 82.842% F1-score in Chinese offensive language detection tasks, delivering high performance and broad application potential.
Funder
National Natural Science Foundation of China
Natural Science Foundation of Xinjiang Uygur Autonomous Region from Xinjiang, China
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference29 articles.
1. Chung, I., and Lin, C.J. (2021, January 10–12). TOCAB: A Dataset for Chinese Abusive Language Processing. Proceedings of the 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI), IEEE, Las Vegas, NV, USA.
2. A systematic review of Hate Speech automatic detection using Natural Language Processing;Jahan;Neurocomputing,2023
3. López-Vizcaíno, M., Nóvoa, F.J., Artieres, T., and Cacheda, F. (2023). Site Agnostic Approach to Early Detection of Cyberbullying on Social Media Networks. Sensors, 23.
4. Wulczyn, E., Thain, N., and Dixon, L. (2017, January 3–7). Ex machina: Personal attacks seen at scale. Proceedings of the 26th International Conference on World Wide Web, Perth, Australia.
5. Zhao, Y., and Tao, X. (2021, January 19–23). ZYJ123@ DravidianLangTech-EACL2021: Offensive Language Identification Based on XLM-RoBERTa with DPCNN. Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Kiev, Ukraine.