A data quality metric (DQM)-Reference-Cited by-同舟云学术

A data quality metric (DQM)

Published:2017-06 Issue:10 Volume:10 Page:1094-1105
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Chung Yeounoh¹,Krishnan Sanjay²,Kraska Tim¹

Affiliation:

1. Brown University

2. UC Berkeley

Abstract

Data cleaning, whether manual or algorithmic, is rarely perfect leaving a dataset with an unknown number of false positives and false negatives after cleaning. In many scenarios, quantifying the number of remaining errors is challenging because our data integrity rules themselves may be incomplete, or the available gold-standard datasets may be too small to extrapolate. As the use of inherently fallible crowds becomes more prevalent in data cleaning problems, it is important to have estimators to quantify the extent of such errors. We propose novel species estimators to estimate the number of distinct remaining errors in a dataset after it has been cleaned by a set of crowd workers -- essentially, quantifying the utility of hiring additional workers to clean the dataset. This problem requires new estimators that are robust to false positives and false negatives, and we empirically show on three real-world datasets that existing species estimators are unstable for this problem, while our proposed techniques quickly converge.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3115404.3115414

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Data-Driven Analysis of Behaviors in Data Curation Processes;ACM Transactions on Information Systems;2023-02-07

2. Information Resilience: the nexus of responsible and agile approaches to information use;The VLDB Journal;2022-01-16

3. Multi-Factor Influencing Truth Inference in Crowdsourcing;J INF SCI ENG;2021

4. Contextual Data Cleaning with Ontology FDs;Proceedings of the 2021 International Conference on Management of Data;2021-06-09

5. Advanced battery management strategies for a sustainable energy future: Multilayer design concepts and research trends;Renewable and Sustainable Energy Reviews;2021-03