Generative adversarial networks for imputing missing data for big data clinical research-Reference-Cited by-同舟云学术

Generative adversarial networks for imputing missing data for big data clinical research

Published:2021-04-20 Issue:1 Volume:21 Page:
ISSN:1471-2288
Container-title:BMC Medical Research Methodology
language:en
Short-container-title:BMC Med Res Methodol

Author:

Dong Weinan,Fong Daniel Yee Tak,Yoon Jin-sun,Wan Eric Yuk Fai,Bedford Laura Elizabeth,Tang Eric Ho Man,Lam Cindy Lo Kuen

Abstract

Abstract Background Missing data is a pervasive problem in clinical research. Generative adversarial imputation nets (GAIN), a novel machine learning data imputation approach, has the potential to substitute missing data accurately and efficiently but has not yet been evaluated in empirical big clinical datasets. Objectives This study aimed to evaluate the accuracy of GAIN in imputing missing values in large real-world clinical datasets with mixed-type variables. The computation efficiency of GAIN was also evaluated. The performance of GAIN was compared with other commonly used methods, MICE and missForest. Methods Two real world clinical datasets were used. The first was that of a cohort study on the long-term outcomes of patients with diabetes (50,000 complete cases), and the second was of a cohort study on the effectiveness of a risk assessment and management programme for patients with hypertension (10,000 complete cases). Missing data (missing at random) to independent variables were simulated at different missingness rates (20, 50%). The normalized root mean square error (NRMSE) between imputed values and real values for continuous variables and the proportion of falsely classified (PFC) for categorical variables were used to measure imputation accuracy. Computation time per imputation for each method was recorded. The differences in accuracy of different imputation methods were compared using ANOVA or non-parametric test. Results Both missForest and GAIN were more accurate than MICE. GAIN showed similar accuracy as missForest when the simulated missingness rate was 20%, but was more accurate when the simulated missingness rate was 50%. GAIN was the most accurate for the imputation of skewed continuous and imbalanced categorical variables at both missingness rates. GAIN had a much higher computation speed (32 min on PC) comparing to that of missForest (1300 min) when the sample size is 50,000. Conclusion GAIN showed better accuracy as an imputation method for missing data in large real-world clinical datasets compared to MICE and missForest, and was more resistant to high missingness rate (50%). The high computation speed is an added advantage of GAIN in big clinical data research. It holds potential as an accurate and efficient method for missing data imputation in future big data clinical research. Trial registration ClinicalTrials.gov ID: NCT03299010; Unique Protocol ID: HKUCTR-2232

Publisher

Springer Science and Business Media LLC

Subject

Health Informatics,Epidemiology

Link

https://link.springer.com/content/pdf/10.1186/s12874-021-01272-3.pdf

Reference28 articles.

1. Li P, Stuart EA, Allison DB. Multiple imputation: a flexible tool for handling missing DataMultiple ImputationMultiple imputation. JAMA. 2015;314(18):1966–7. https://doi.org/10.1001/jama.2015.15281.

2. Yoon J, Davtyan C, van der Schaar M. Discovery and clinical decision support for personalized healthcare. IEEE J Biomed Health Inform. 2017;21(4):1133–45. https://doi.org/10.1109/JBHI.2016.2574857.

3. Altman DG, Bland JM. Missing data. BMJ (Clinical research ed). 2007;334(7590):424.

4. Robinson KA, Dennison CR, Wayman DM, Pronovost PJ, Needham DM. Systematic review identifies number of strategies important for retaining study participants. J Clin Epidemiol. 2007;60(8):757.e1–e19.

5. Hayati Rezvan P, Lee KJ, Simpson JA. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC Med Res Methodol. 2015;15(1):30. https://doi.org/10.1186/s12874-015-0022-1.

Cited by 34 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review;BMC Medical Research Methodology;2024-08-28

2. Development and validation of 10‐year risk prediction models of cardiovascular disease in Chinese type 2 diabetes mellitus patients in primary care using interpretable machine learning‐based methods;Diabetes, Obesity and Metabolism;2024-07-15

3. From Simulation to Prediction: Enhancing Digital Twins with Advanced Generative AI Technologies;2024 IEEE 18th International Conference on Control & Automation (ICCA);2024-06-18

4. A systematic data characteristic understanding framework towards physical-sensor big data challenges;Journal of Big Data;2024-06-12

5. Imputation of missing photometric data and photometric redshift estimation for CSST;Monthly Notices of the Royal Astronomical Society;2024-06-05