ActiveClean-Reference-Cited by-同舟云学术

ActiveClean

Published:2016-08 Issue:12 Volume:9 Page:948-959
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Krishnan Sanjay¹,Wang Jiannan²,Wu Eugene³,Franklin Michael J.¹,Goldberg Ken¹

Affiliation:

1. UC Berkeley

2. Simon Fraser University

3. Columbia University

Abstract

Analysts often clean dirty data iteratively--cleaning some data, executing the analysis, and then cleaning more data based on the results. We explore the iterative cleaning process in the context of statistical model training, which is an increasingly popular form of data analytics. We propose ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees. ActiveClean supports an important class of models called convex loss models (e.g., linear regression and SVMs), and prioritizes cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, IMDB, and Dollars For Docs with both real and synthetic errors. The results show that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2994509.2994514

Cited by 96 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Construction of Knowledge Graphs: Current State and Challenges;Information;2024-08-22

2. Enhancing data preparation: insights from a time series case study;Journal of Intelligent Information Systems;2024-07-25

3. Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines;Proceedings of the VLDB Endowment;2024-07

4. BUNNI: Learning Repair Actions in Rule-driven Data Cleaning;Journal of Data and Information Quality;2024-06-24

5. Data cleaning and machine learning: a systematic literature review;Automated Software Engineering;2024-06-11