Affiliation:
1. Department of Computer Science, University of Engineering and Technology, Lahore P.O. Box 54890, Pakistan
2. Artificial Intelligence and Data Analytics Laboratory, College of Computer and Information Sciences (CCIS), Prince Sultan University, Riyadh P.O. Box 66833, Saudi Arabia
Abstract
Spam communications from spam ads and social media platforms such as Facebook, Twitter, and Instagram are increasing, making spam detection more popular. Many languages are used for spam review identification, including Chinese, Urdu, Roman Urdu, English, Turkish, etc.; however, there are fewer high-quality datasets available for Urdu. This is mainly because Urdu is less extensively used on social media networks such as Twitter, making it harder to collect huge volumes of relevant data. This paper investigates policy-based Urdu tweet spam detection. This study aims to collect over 1,100,000 real-time tweets from multiple users. The dataset is carefully filtered to comply with Twitter’s 100-tweet-per-hour limit. For data collection, the snscrape library is utilized, which is equipped with an API for accessing various attributes such as username, URL, and tweet content. Then, a machine learning pipeline consisting of TF-IDF, Count Vectorizer, and the following machine learning classifiers: multinomial naïve Bayes, support vector classifier RBF, logical regression, and BERT, are developed. Based on Twitter policy standards, feature extraction is performed, and the dataset is separated into training and testing sets for spam analysis. Experimental results show that the logistic regression classifier has achieved the highest accuracy, with an F1-score of 0.70 and an accuracy of 99.55%. The findings of the study show the effectiveness of policy-based spam detection in Urdu tweets using machine learning and BERT layer models and contribute to the development of a robust Urdu language social media spam detection method.
Funder
Artificial Intelligence and Data Analytics Laboratory, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia
University of Engineering and Technology (UET), Lahore
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference44 articles.
1. Alorini, D., and Rawat, D.B. (2019, January 18–21). Automatic spam detection on gulf dialectical. Proceedings of the Conference on Computing, Networking and Communication, Honolulu, HI, USA.
2. Addressing the class imbalance problem in Twitter spam detection using ensemble learning;Liu;Comput. Secur.,2017
3. Wu, T., Liu, S., Zhang, J., and Xiang, Y. (2017, January 31). Twitter spam detection based on deep learning. Proceedings of the Australasian Computer Science Week Multiconference, Geelong, Australia.
4. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions;Alzubaidi;J. Big Data,2021
5. Improving spam email detection using deep recurrent neural network;Ghouzali;Inst. Adv. Eng. Sci.,2022
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献