Affiliation:
1. Department of Computer Science and Engineering, College of Engineering, American University of Sharjah, Sharjah P.O. Box 26666, United Arab Emirates
2. Department of Electrical Engineering, Canadian University of Dubai, Dubai P.O. Box 117781, United Arab Emirates
Abstract
Variable (feature) selection plays an important role in data analysis and mathematical modeling. This paper aims to address the significant lack of formal evaluation benchmarks for feature selection algorithms (FSAs). To evaluate FSAs effectively, controlled environments are required, and the use of synthetic datasets offers significant advantages. We introduce a set of ten synthetically generated datasets with known relevance, redundancy, and irrelevance of features, derived from various mathematical, logical, and geometric sources. Additionally, eight FSAs are evaluated on these datasets based on their relevance and novelty. The paper first introduces the datasets and then provides a comprehensive experimental analysis of the performance of the selected FSAs on these datasets including testing the FSAs’ resilience on two types of induced data noise. The analysis has guided the grouping of the generated datasets into four groups of data complexity. Lastly, we provide public access to the generated datasets to facilitate bench-marking of new feature selection algorithms in the field via our Github repository. The contributions of this paper aim to foster the development of novel feature selection algorithms and advance their study.
Funder
Open Access Program from the American University of Sharjah
Reference50 articles.
1. An ensemble of filters and classifiers for microarray data classification;Pattern Recognit.,2012
2. Feature selection for medical diagnosis: Evaluation for cardiovascular diseases;Shilaskar;Expert Syst. Appl.,2013
3. Feng, Y., Akiyama, H., Lu, L., and Sakurai, K. (2018, January 12–15). Feature Selection for Machine Learning-Based Early Detection of Distributed Cyber Attacks. Proceedings of the 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech), Athens, Greece.
4. A Supervised Feature Selection Approach Based on Global Sensitivity;Sulieman;Arch. Data Sci. Ser. (Online First),2018
5. Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., and O’Sullivan, J.M. (2022). A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform., 2.