Missing Data Imputation with Uncertainty-Driven Network-Reference-Cited by-同舟云学术

Missing Data Imputation with Uncertainty-Driven Network

Published:2024-05-29 Issue:3 Volume:2 Page:1-25
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Wang Jianwei¹^ORCID,Zhang Ying²^ORCID,Wang Kai³^ORCID,Lin Xuemin³^ORCID,Zhang Wenjie¹^ORCID

Affiliation:

1. University of New South Wales, Sydney, Australia

2. Zhejiang Gongshang University, Hangzhou, China

3. Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai, China

Abstract

We study the problem of missing data imputation, which is a fundamental task in the area of data quality that aims to impute the missing data to achieve the completeness of datasets. Though the recent distribution-modeling-based techniques (e.g., distribution generation and distribution matching) can achieve state-of-the-art performance in terms of imputation accuracy, we notice that (1) they deploy a sophisticated deep learning model that tends to be overfitting for missing data imputation; (2) they directly rely on a global data distribution while overlooking the local information. Driven by the inherent variability in both missing data and missing mechanisms, in this paper, we explore the uncertain nature of this task and aim to address the limitations of existing works by proposing an uNcertainty-driven netwOrk for Missing data Imputation, termed NOMI. NOMI has three key components, i.e., the retrieval module, the neural network gaussian process imputator (NNGPI) and the uncertainty-based calibration module. NOMI~ runs these components sequentially and in an iterative manner to achieve a better imputation performance. Specifically, in the retrieval module, NOMI~ retrieves local neighbors of the incomplete data samples based on the pre-defined similarity metric. Subsequently, we design NNGPI~ that merges the advantages of both the Gaussian Process and the universal approximation capacity of neural networks. NNGPI~ models the uncertainty by learning the posterior distribution over the data to impute missing values while alleviating the overfitting issue. Moreover, we further propose an uncertainty-based calibration module that utilizes the uncertainty of the imputator on its prediction to help the retrieval module obtain more reliable local information, thereby further enhancing the imputation performance. We also demonstrate that our NOMI~ can be reformulated as an instance of the well-known Expectation Maximization (EM) algorithm, highlighting the strong theoretical foundation of our proposed methods. Extensive experiments are conducted over 12 real-world datasets. The results demonstrate the excellent performance of NOMI in terms of both accuracy and efficiency.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3654920

Reference62 articles.

1. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression

2. Machine Learning and Deep Learning for Phishing Email Classification using One-Hot Encoding

3. Steven L Brunton and J Nathan Kutz. 2022. Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press.

4. XGBoost

5. Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient Unsupervised Community Search with Pre-Trained Graph Transformer;Proceedings of the VLDB Endowment;2024-05