Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques-Reference-Cited by-同舟云学术

Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques

Published:2024-03-05 Issue:3 Volume:46 Page:6019-6040
ISSN:1064-1246
Container-title:Journal of Intelligent & Fuzzy Systems
language:
Short-container-title:IFS

Author:

Tyagi Pooja¹,Singh Jaspreeti¹,Gosain Anjana¹

Affiliation:

1. University School of Information, Communication &Technology, Guru Gobind Singh Indraprastha University, Dwarka, NewDelhi, India

Abstract

The contemporary real-world datasets often suffer from the problem of class imbalance as well as high dimensionality. For combating class imbalance, data resampling is a commonly used approach whereas for tackling high dimensionality feature selection is used. The aforesaid problems have been studied extensively as independent problems in the literature but the possible synergy between them is still not clear. This paper studies the effects of addressing both the issues in conjunction by using a combination of resampling and feature selection techniques on binary-class imbalance classification. In particular, the primary goal of this study is to prioritize the sequence or pipeline of using these techniques and to analyze the performance of the two opposite pipelines that apply feature selection before or after resampling techniques i.e., F + S or S + F. For this, a comprehensive empirical study is carried out by conducting a total of 34,560 tests on 30 publicly available datasets using a combination of 12 resampling techniques for class imbalance and 12 feature selection methods, evaluating the performance on 4 different classifiers. Through the experiments we conclude that there is no specific pipeline that proves better than the other and both the pipelines should be considered for obtaining the best classification results on high dimensional imbalanced data. Additionally, while using Decision Tree (DT) or Random Forest (RF) as base learner the predominance of S + F over F + S is observed whereas in case of Support Vector Machine (SVM) and Logistic Regression (LR), F + S outperforms S + F in most cases. According to the mean ranking obtained from Friedman test the best combination of resampling and feature selection techniques for DT, SVM, LR and RF are SMOTE + RFE (Synthetic Minority Oversampling Technique and Recursive Feature Elimination), Least Absolute Shrinkage and Selection Operator (LASSO) + SMOTE, SMOTE + Embedded feature selection using RF and SMOTE + RFE respectively.

Publisher

IOS Press

Reference63 articles.

1. Empirical comparisons for combining balancing and feature selection strategies forcharacterizing football players using FIFA video game system;Al-Asadi;IEEE Access,2021

2. DBFS: An effective Density Based Feature Selection scheme for small sample size andhigh dimensional imbalanced data sets;Alibeigi;Data & KnowledgeEngineering,2012

3. Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study;Amin;IEEE Access,2016

4. Supervised,unsupervised, and semi-supervised feature selection: a review ongene selection;Ang;IEEE/ACM Transactions on Computational Biologyand Bioinformatics,2015

5. A study of the behavior of several methods for balancing machine learning training data;Batista;ACM SIGKDD Explorations Newsletter,2004