Affiliation:
1. University of West Florida, USA
Abstract
The focus of this work is on detecting and classifying attacks in network traffic using a binary as well as multi-class machine learning classifier, Random Forest, in a distributed Big Data environment using Apache Spark. The classifier is tested using the UNSW-NB15 dataset. Major problems in these types of datasets include high dimensionality and imbalanced data. To address the issue of high dimensionality, both Information Gain as well as Principal Components Analysis (PCA) were applied before training and testing the data using Random Forest in Apache Spark. Binary as well as multi-class Random Forest classifiers were compared in a distributed environment, with and without using PCA, using various number of Spark cores and Random Forest trees, in terms of performance time and statistical measures. The highest accuracy was obtained by the binary classifier at 99.94%, using 8 cores and 30 trees. This study obtained higher accuracy and lower FAR rates than previously achieved, with low testing times.
Reference29 articles.
1. Amrita, & Kant, S. (2019). Machine Learning and Feature Selection Approach for Anomaly based Intrusion Detection: A Systematic Novice Approach. International Journal of Innovative Technology and Exploring Engineering, 8(65), 434-443.
2. Resampling imbalanced data for network intrusion detection datasets
3. Performance evaluation of intrusion detection based on machine learning using Apache Spark
4. Brems, M. (2019). A One-Stop Shop for Principal Component Analysis. Towards Data Science. Available: https://towarddatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
5. Random Forests;L.Brieman;Machine Learning,2001
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献