Affiliation:
1. Département de sciences de la décision HEC Montréal Montréal Québec Canada
2. iA Financial Group Montréal Québec Canada
3. Département de mathématiques UQÀM Montréal Québec Canada
Abstract
AbstractIn supervised learning, feature selection methods identify the most relevant predictors to include in a model. For linear models, the inclusion or exclusion of each variable may be represented as a vector of bits playing the role of the genetic material that defines the model. Genetic algorithms reproduce the strategies of natural selection on a population of models to identify the best. We derive the distribution of the importance scores for parallel genetic algorithms under the null hypothesis that none of the features has predictive power. They, hence, provide an objective threshold for feature selection that does not require the visual inspection of a bubble plot. We also introduce the eradication strategy, akin to forward stepwise selection, where the genes of useful variables are sequentially forced into the models. The method is illustrated on real data, and simulation studies are run to describe its performance.
Funder
Natural Sciences and Engineering Research Council of Canada
Subject
Statistics, Probability and Uncertainty,Statistics and Probability