Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction-Reference-Cited by-同舟云学术

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Published:2020-07-25 Issue:1 Volume:20 Page:
ISSN:1471-2288
Container-title:BMC Medical Research Methodology
language:en
Short-container-title:BMC Med Res Methodol

Author:

Hong Shangzhi,Lynn Henry S.^ORCID

Abstract

Abstract Background Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. Methods To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Results Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. Conclusions RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

Publisher

Springer Science and Business Media LLC

Subject

Health Informatics,Epidemiology

Link

https://link.springer.com/content/pdf/10.1186/s12874-020-01080-1.pdf

Reference19 articles.

1. Van Buuren S. Flexible imputation of missing data: chapman and hall/CRC; 2018.

2. Stekhoven DJ, Buhlmann P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8.

3. Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. Am J Epidemiol. 2014;179(6):764–74.

4. Ramosaj B, Pauly M. Predicting missing values: A comparative study on non-parametric approaches for imputation. Comput Stat. 2019;34(4):1741–1764.

5. Tang F, Ishwaran H. Random Forest missing data algorithms. Stat Analysis Data Mining. 2017;10(6):363–77.

Cited by 120 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Hybrid deep learning based prediction for water quality of plain watershed;Environmental Research;2024-12

2. A novel and efficient risk minimisation-based missing value imputation algorithm;Knowledge-Based Systems;2024-11

3. The challenges of using machine learning models in psychiatric research and clinical practice;European Neuropsychopharmacology;2024-11

4. Using Multi-Source data to identify high NOx emitting Heavy-Duty diesel vehicles;Transportation Research Part D: Transport and Environment;2024-09

5. Multidimensional-Based Prediction of Pressure Ulcers Development and Severity in Hospitalized Frail Oldest Old: A Retrospective Study;Clinical Interventions in Aging;2024-09