Abstract
Introduction: Analyzing the high-dimensional datasets used for intrusion detection becomes a challenge for researchers. This paper presents the most often used data sets. ADFA contains two data sets containing records from Linux/Unix. AWID is based on actual traces of normal and intrusion activity of an IEEE 802.11 Wi-Fi network. CAIDA collects data types in geographically and topologically diverse regions. In CIC-IDS2017, HTTP, HTTPS, FTP, SSH, and email protocols are examined. CSECIC-2018 includes abstract distribution models for applications, protocols, or lower-level network entities. DARPA contains data of network traffic. ISCX 2012 dataset has profiles on various multi-stage attacks and actual network traffic with background noise. KDD Cup '99 is a collection of data transfer from a virtual environment. Kyoto 2006+ contains records of real network traffic. It is used only for anomaly detection. NSL-KDD corrects flaws in the KDD Cup '99 caused by redundant and duplicate records. UNSW-NB-15 is derived from real normal data and the synthesized contemporary attack activities of the network traffic. Methods: This study uses both quantitative and qualitative techniques. The scientific references and publicly accessible information about given dataset are used. Results: Datasets are often simulated to meet objectives required by a particular organization. The number of real datasets are very small compared to simulated dataset. Anomaly detection is rarely used today. Conclusion: 95 The main characteristics and a comparative analysis of the data sets in terms of the date they were created, the size, the number of features, the traffic types, and the purpose are presented.
Publisher
Centre for Evaluation in Education and Science (CEON/CEES)