Affiliation:
1. Department of Urology, Aerospace Center Hospital, Beijing, China
2. School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
3. Department of Urology, The Second Hospital of Tianjin Medical University, Tianjin, China
Abstract
Feature selection plays a crucial role in classification tasks as part of the data preprocessing process. Effective feature selection can improve the robustness and interpretability of learning algorithms, and accelerate model learning. However, traditional statistical methods for feature selection are no longer practical in the context of high-dimensional data due to the computationally complex. Ensemble learning, a prominent learning method in machine learning, has demonstrated exceptional performance, particularly in classification problems. To address the issue, we propose a three-stage feature selection algorithm framework for high-dimensional data based on ensemble learning (EFS-GINI). Firstly, highly linearly correlated features are eliminated using the Spearman coefficient. Then, a feature selector based on the F-test is employed for the first stage selection. For the second stage, four feature subsets are formed using mutual information (MI), ReliefF, SURF, and SURF* filters in parallel. The third stage involves feature selection using a combinator based on GINI coefficient. Finally, a soft voting approach is proposed to employ for classification, including decision tree, naive Bayes, support vector machine (SVM), k-nearest neighbors (KNN) and random forest classifiers. To demonstrate the effectiveness and efficiency of the proposed algorithm, eight high-dimensional datasets are used and five feature selection methods are employed to compare with our proposed algorithm. Experimental results show that our method effectively enhances the accuracy and speed of feature selection. Moreover, to explore the biological significance of the proposed algorithm, we apply it on the renal cell carcinoma dataset GSE40435 from the Gene Expression Omnibus database. Two feature genes, NOP2 and NSUN5, are selected by our proposed algorithm. They are directly involved in regulating m5c RNA modification, which reveals the biological importance of EFS-GINI. Through bioinformatics analysis, we shows that m5C-related genes play an important role in the occurrence and progression of renal cell carcinoma, and are expected to become an important marker to predict the prognosis of patients.
Funder
National Natural Science Foundation of China
Reference36 articles.
1. Training data optimization strategy for multiclass text classification;Arusada,2017
2. Global DNA demethylation is an epigenetic marker of human brain metastases;Barciszewska;Bioscience Reports,2018
3. Research on anti-fraud of auto insurance claims settlement based on data mining technology;Bo;Master’s thesis,2018
4. Ensemble deep learning in bioinformatics;Cao;Nature Machine Intelligence,2020
5. A survey on feature selection methods;Chandrashekar;Computers & Electrical Engineering,2014