Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA-Reference-Cited by-同舟云学术

Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA

Published:2022-02-25 Issue:4 Volume:109 Page:881-900
ISSN:0006-3444
Container-title:Biometrika
language:en
Short-container-title:

Author:

Bénard Clément¹,Da Veiga Sébastien¹,Scornet Erwan²

Affiliation:

1. Digital Sciences & Technologies Safran Tech, , 78114 Magny-Les-Hameaux, France

2. Institut Polytechnique de Paris Center for Applied Mathematics, UMR7641, École polytechnique, , 91120 Palaiseau, France

Abstract

Summary Variable importance measures are the main tools used to analyse the black-box mechanisms of random forests. Although the mean decrease accuracy is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of mean decrease accuracy varies across the main random forest software. In this article, our objective is to rigorously analyse the behaviour of the main mean decrease accuracy implementations. Consequently, we mathematically formalize the various implemented mean decrease accuracy algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these mean decrease accuracy versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the mean decrease accuracy does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-mean decrease accuracy, which fixes the flaws of the original mean decrease accuracy, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-mean decrease accuracy empirically outperforms its competitors on both simulated and real data for variable selection.

Publisher

Oxford University Press (OUP)

Subject

Applied Mathematics,Statistics, Probability and Uncertainty,General Agricultural and Biological Sciences,Agricultural and Biological Sciences (miscellaneous),General Mathematics,Statistics and Probability

Link

https://academic.oup.com/biomet/advance-article-pdf/doi/10.1093/biomet/asac017/43557265/asac017.pdf

Reference51 articles.

1. Explaining individual predictions when features are dependent: more accurate approximations to Shapley values;Aas,;Artif. Intel.,2021

2. Random forests for global sensitivity analysis: a selective review;Antoniadis,;Reliab. Eng. Syst. Safety,2020

3. Empirical characterization of random forest variable importance measures;Archer,;Comp. Statist. Data Anal.,2008

4. Empirical comparison of tree ensemble variable importance measures;Auret,;Chemom. Intell. Lab. Syst.,2011

5. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics;Boulesteix,;Data Mining Know. Disc.,2012

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Variable importance analysis of wind turbine extreme responses with Shapley value explanation;Renewable Energy;2024-10

2. Exploring Responsiveness to Highly Challenging Balance and Gait Training in Parkinson's Disease;Movement Disorders Clinical Practice;2024-08-21

3. A comparative study of machine learning methods for identifying the 15 CIE standard skies;Journal of Building Physics;2024-08-05

4. Synergistic Biocontrol and Growth Promotion in Strawberries by Co-Cultured Trichoderma harzianum TW21990 and Burkholderia vietnamiensis B418;Journal of Fungi;2024-08-05

5. Shapley Curves: A Smoothing Perspective;Journal of Business & Economic Statistics;2024-07-29