Abstract
AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Linguistics and Language,Language and Linguistics
Reference29 articles.
1. Albridge KM, Standish J, Fries JF (1988) Hierarchical time-oriented approaches to missing data inference. Computers and Biomedical Research 21(4):349–366
2. Banks J, Breeze E, Lessof C, Nazroo J (2016) The dynamics of ageing: Evidence from the English Longitudinal Study of Ageing 2002–15 (Wave 7). Institute for Fiscal Studies, London. http://www.elsa-project.ac.uk/publicationDetails/id/8696
3. Banks J, Batty G, Coughlin K, Deepchand K, Marmot M, Nazroo J, Oldfield Z, Steel N, Steptoe MA, Wood, Zaninotto P (2019) English longitudinal study of ageing: Waves 0–8, 1998–2017.[data collection]
4. Belger M, Haro J, Reed C, Happich M, Kahle-Wrobleski K, Argimon J, Bruno G, Dodel R, Jones R, Vellas B et al (2016) How to deal with missing longitudinal data in cost of illness analysis in alzheimer’s disease–suggestions from the geras observational study. BMC Medical Research Methodology 16(1):83
5. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful?. In: International conference on database theory. Springer, pp 217–235
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献