Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process-Reference-Cited by-同舟云学术

Relating the Partial Dependence Plot and Permutation Feature Importance to the Data Generating Process

Published:2023 Issue: Volume: Page:456-479
ISSN:1865-0929
Container-title:Communications in Computer and Information Science
language:
Short-container-title:

Author:

Molnar Christoph^ORCID,Freiesleben Timo^ORCID,König Gunnar^ORCID,Herbinger Julia,Reisinger Tim,Casalicchio Giuseppe^ORCID,Wright Marvin N.^ORCID,Bischl Bernd^ORCID

Abstract

AbstractScientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. Consequently, the parameters of machine learning models usually cannot be easily related to the data generating process. To learn about the modeled relationships, partial dependence (PD) plots and permutation feature importance (PFI) are often used as interpretation methods. However, PD and PFI lack a theory that relates them to the data generating process. We formalize PD and PFI as statistical estimators of ground truth estimands rooted in the data generating process. We show that PD and PFI estimates deviate from this ground truth not only due to statistical biases, but also due to learner variance and Monte Carlo approximation errors. To account for these uncertainties in PD and PFI estimation, we propose the learner-PD and the learner-PFI based on model refits and propose corrected variance and confidence interval estimators.

Publisher

Springer Nature Switzerland

Link

https://link.springer.com/content/pdf/10.1007/978-3-031-44064-9_24

Reference49 articles.

1. Altmann, A., Toloşi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340–1347 (2010)

2. Apley, D.W., Zhu, J.: Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 82(4), 1059–1086 (2020)

3. Archer, K.J., Kimes, R.V.: Empirical characterization of random forest variable importance measures. Comput. Stat. Data Anal. 52(4), 2249–2260 (2008)

4. Bair, E., et al.: Multivariable modeling of phenotypic risk factors for first-onset TMD: the OPPERA prospective cohort study. J. Pain 14(12), T102–T115 (2013)

5. Bates, S., Candès, E., Janson, L., Wang, W.: Metropolized knockoff sampling. J. Am. Stat. Assoc. 116(535), 1413–1427 (2021)

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Feature Identification Using Interpretability Machine Learning Predicting Risk Factors for Disease Severity of In-Patients with COVID-19 in South Florida;Diagnostics;2024-08-26

2. ANALYZING PMV VARIABILITY CHARACTERISTICS USING XAI;Journal of Environmental Engineering (Transactions of AIJ);2024-08-01

3. Scientific Inference with Interpretable Machine Learning: Analyzing Models to Learn About Real-World Phenomena;Minds and Machines;2024-07-15

4. Rapid detection of turtle cracks in corn seed based on reflected and transmitted images combined with deep learning method;Microchemical Journal;2024-06

5. Regression Model for the Prediction of Total Motor Power Used by an Industrial Robot Manipulator during Operation;Machines;2024-03-28