Author:
Ma Zhengchi,Ouyang Ruoyu,Wang Hanzhang
Abstract
The objective of this study was to investigate the performance of the Random Forest algorithm in spam detection when generalized from email spam to social media comment spam. The dataset used involved the use of two sources: an email dataset and a YouTube spam comment dataset. Text processing techniques and feature extraction methods were applied to preprocess the datasets using scikit-learn package. Labels were mapped from "spam" and "ham" to "1" and "0" respectively for training and testing the model. The email spam dataset was split into training and testing datasets, and the first 3000 lines were used for training the model. The generalization ability of the model was tested on the YouTube spam comment dataset. Multiple decision trees were created using the Random Forest algorithm and were trained on different subsets of the training data. The results indicated that the accuracy rate of the prediction on the YouTube spam comment dataset was only around 62%, which is comparatively low. This suggests that the Random Forest algorithm, when used for spam detection, may not have good enough generalization ability to be applied in practice. Additionally, as the number of trees increased, the maximum accuracy decreased, indicating the possibility of overfitting. Although the accuracy of the models was modest, possible improvements could be made to the pre-processing of the data so that the features extracted from the text can have greater conformity with social media spams. In conclusion, further work is needed before the model can be used in generalized situations.
Publisher
Darcy & Roy Press Co. Ltd.
Reference10 articles.
1. Gordon V. Cormack. Email Spam Filtering: A Systematic Review, Foundations and Trends® in Information Retrieval: Vol. 1: No. 4, pp 335-455. http://dx.doi.org/10.1561/1500000006, 2008.
2. Khaidem L, Saha S, Dey S R. Predicting the direction of stock market prices using random forest. arXiv preprint arXiv:1605.00003, 2016.
3. Yu Q, Wang J, Jin Z, et al. Pose-guided matching based on deep learning for assessing quality of action on rehabilitation training. Biomedical Signal Processing and Control, 2022, 72: 103323.
4. Petre E G. A decision tree for weather prediction. Universitatea Petrol-Gaze din Ploiesti, 2009, 61(1): 77-82.
5. Hovold J. Naive Bayes Spam Filtering Using Word-Position-Based Attributes. CEAS. 2005: 41-48.