Affiliation:
1. Digital Sciences & Technologies Safran Tech, , 78114 Magny-Les-Hameaux, France
2. Institut Polytechnique de Paris Center for Applied Mathematics, UMR7641, École polytechnique, , 91120 Palaiseau, France
Abstract
Summary
Variable importance measures are the main tools used to analyse the black-box mechanisms of random forests. Although the mean decrease accuracy is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of mean decrease accuracy varies across the main random forest software. In this article, our objective is to rigorously analyse the behaviour of the main mean decrease accuracy implementations. Consequently, we mathematically formalize the various implemented mean decrease accuracy algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these mean decrease accuracy versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the mean decrease accuracy does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-mean decrease accuracy, which fixes the flaws of the original mean decrease accuracy, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-mean decrease accuracy empirically outperforms its competitors on both simulated and real data for variable selection.
Publisher
Oxford University Press (OUP)
Subject
Applied Mathematics,Statistics, Probability and Uncertainty,General Agricultural and Biological Sciences,Agricultural and Biological Sciences (miscellaneous),General Mathematics,Statistics and Probability
Reference51 articles.
1. Explaining individual predictions when features are dependent: more accurate approximations to Shapley values;Aas,;Artif. Intel.,2021
2. Random forests for global sensitivity analysis: a selective review;Antoniadis,;Reliab. Eng. Syst. Safety,2020
3. Empirical characterization of random forest variable importance measures;Archer,;Comp. Statist. Data Anal.,2008
4. Empirical comparison of tree ensemble variable importance measures;Auret,;Chemom. Intell. Lab. Syst.,2011
5. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics;Boulesteix,;Data Mining Know. Disc.,2012
Cited by
30 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献