A Benchmark for Data Imputation Methods-Reference-Cited by-同舟云学术

A Benchmark for Data Imputation Methods

Published:2021-07-08 Issue: Volume:4 Page:
ISSN:2624-909X
Container-title:Frontiers in Big Data
language:
Short-container-title:Front. Big Data

Author:

Jäger Sebastian,Allhorn Arndt,Bießmann Felix

Abstract

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.

Publisher

Frontiers Media SA

Subject

Artificial Intelligence,Information Systems,Computer Science (miscellaneous)

Reference53 articles.

1. Detecting Data Errors;Abedjan;Proc. VLDB Endow.,2016

2. Data Profiling;Abedjan;Synth. Lectures Data Manag.,2018

3. An Analysis of Four Missing Data Treatment Methods for Supervised Learning;Batista;Appl. Artif. Intelligence,2003

4. Tfx;Baylor,2017

5. On the Dangers of Stochastic Parrots;Bender,2021

Cited by 91 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine Learning for Predicting Prehurricane Structural Damage;Natural Hazards Review;2024-11

2. On the consistency of supervised learning with missing values;Statistical Papers;2024-09-12

3. Methodological approaches in developing and implementing digital health interventions amongst underserved women;Public Health Nursing;2024-09-02

4. Accurate diagnosis of acute appendicitis in the emergency department: an artificial intelligence-based approach;Internal and Emergency Medicine;2024-08-21

5. Fatigued individuals show increased conformity in virtual meetings;Scientific Reports;2024-08-13