Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods-Reference-Cited by-同舟云学术

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Published:2024-03-26 Issue:1 Volume:11 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Wang Huanjing,Liang Qianxin,Hancock John T.,Khoshgoftaar Taghi M.

Abstract

AbstractIn the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s40537-024-00905-w.pdf

Reference34 articles.

1. Hancock JT, Khoshgoftaar TM, Johnson JM. A comparative approach to threshold optimization for classifying imbalanced data. In: The International Conference on Collaboration and Internet Computing (CIC), Atlanat, GA, USA, 2022. pp. 135–142. IEEE.

2. Wang H, Liang Q, Hancock JT, Khoshgoftaar TM. Enhancing credit card fraud detection through a novel ensemble feature selection technique. In: 2023 IEEE International Conference on Information Reuse and Integration (IRI), Bellevue, WA, USA, 2023. pp. 121–126.

3. Lundberg S.M, Lee S.-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30.

4. Waspada I, Bahtiar N, Wirawan PW, Awa BDA. Performance analysis of isolation forest algorithm in fraud detection of credit card transactions. Khazanah Informatika Jurnal. 2022.

5. Wang H, Hancock JT, Khoshgoftaar TM. Improving medicare fraud detection through big data size reduction techniques. In: 2023 IEEE International Conference on Service-Oriented System Engineering (SOSE), Athens, Greece; 2023. pp. 208–217.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Utilising intraoperative respiratory dynamic features for developing and validating an explainable machine learning model for postoperative pulmonary complications. Comment on Br J Anaesth 2024; 132: 1315–26;British Journal of Anaesthesia;2024-09

2. Prediction of titanium burn-off and untimate titanium content in electroslag process;Journal of Materials Research and Technology;2024-09

3. Explainable artificial intelligence (XAI) in finance: a systematic literature review;Artificial Intelligence Review;2024-07-26

4. Machine Learning to Predict Drug-Induced Liver Injury and Its Validation on Failed Drug Candidates in Development;Toxics;2024-05-24