Nearest neighbor classifiers over incomplete information-Reference-Cited by-同舟云学术

Nearest neighbor classifiers over incomplete information

Published:2020-11 Issue:3 Volume:14 Page:255-267
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Karlaš Bojan¹,Li Peng²,Wu Renzhi²,Gürel Nezihe Merve¹,Chu Xu²,Wu Wentao³,Zhang Ce¹

Affiliation:

1. ETH Zurich

2. Georgia Institute of Technology

3. Microsoft Research

Abstract

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables , which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3430915.3430917

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data;Proceedings of the ACM on Management of Data;2023-06-13

2. LinCQA: Faster Consistent Query Answering with Linear Time Guarantees;Proceedings of the ACM on Management of Data;2023-05-26

3. RAB: Provable Robustness Against Backdoor Attacks;2023 IEEE Symposium on Security and Privacy (SP);2023-05

4. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04

5. Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04