A Simulation Study Comparing the Use of Supervised Machine Learning Variable Selection Methods in the Psychological Sciences-Reference-Cited by-同舟云学术

A Simulation Study Comparing the Use of Supervised Machine Learning Variable Selection Methods in the Psychological Sciences

Published:2023-09-28 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bain Catherine^ORCID,Shi Dingjing^ORCID,Boness Cassandra L.^ORCID,Loeffelman Jordan

Abstract

When specifying a predictive model for classification, variable selection (or subset selection) is one of the most important steps for researchers to consider. Reducing the necessary number of variables in a prediction model is vital for many reasons, including reducing the burden of data collection and increasing model efficiency and generalizability. The pool of variable selection methods from which to choose is large, and researchers often struggle to identify which method they should use given the specific features of their data set. Yet, there is a scarcity of literature available to guide researchers in their choice; the literature centers on comparing different implementations of a given method rather than comparing different methodologies under varying data features. Through the implementation of a large-scale Monte Carlo simulation and the application to one empirical dataset we evaluated the prediction error rates, area under the receiver operating curve, number of variables selected, computation times, and true positive rates of five different variable selection methods using R under varying parameterizations (i.e., default vs. grid tuning): the genetic algorithm (ga), LASSO (glmnet), Elastic Net (glmnet), Support Vector Machines (svmfs), and random forest (Boruta). Performance measures did not converge upon a single best method; as such, researchers should guide their method selection based on what measure of performance they deem most important. Results did show that the SVM approach performed worst and researchers are advised to use other methods. LASSO and Elastic Net performed well in most conditions, but researchers may face non-convergence problems if these methods are chosen. Random forest performed well across simulation conditions. Based on our study, the genetic algorithm is the most widely applicable method, exhibiting minimum error rates in hold-out samples when compared to other variable selection methods.

Publisher

Center for Open Science

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Tutorial on Supervised Machine Learning Variable Selection Methods for the Social and Health Sciences in R;2024-06-05

2. Investigating Variable Selection Techniques Under Missing Data: A Simulation Study;Springer Proceedings in Mathematics & Statistics;2024