Affiliation:
1. University of Edinburgh and SKLSDE Lab, Beihang University
2. SKLSDE Lab, Beihang University
3. QCRI
4. University of Edinburgh
Abstract
Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity constraints. These are typically treated as separate processes in current data cleaning systems, based on heuristic solutions. This article studies a new problem in connection with data cleaning, namely the interaction between record matching and data repairing. We show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, we provide a uniform framework that seamlessly unifies repairing and matching operations to clean a database based on integrity constraints, matching rules, and master data. We give a full treatment of fundamental problems associated with data cleaning via matching and repairing, including the static analyses of constraints and rules taken together, and the complexity, termination, and determinism analyses of data cleaning. We show that these problems are hard, ranging from NP-complete or coNP-complete, to PSPACE-complete. Nevertheless, we propose efficient algorithms to clean data via both matching and repairing. The algorithms find
deterministic
fixes
and
reliable
fixes
based on confidence and entropy analyses, respectively, which are more accurate than fixes generated by heuristics. Heuristic fixes are produced only when deterministic or reliable fixes are unavailable. We experimentally verify that our techniques can significantly improve the accuracy of record matching and data repairing that are taken as separate processes, using real-life and synthetic data.
Funder
SRF
Shenzhen Peacock Program of China
National Natural Science Foundation of China
Engineering and Physical Sciences Research Council
ROCS
Ministry of Science and Technology of the People's Republic of China
SEM
Guangdong Innovative Research Team Program
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems and Management,Information Systems
Reference58 articles.
1. Abiteboul S. Hull R. and Vianu V. 1995. Foundations of Databases. Addison-Wesley. Abiteboul S. Hull R. and Vianu V. 1995. Foundations of Databases . Addison-Wesley.
2. Large-Scale Deduplication with Constraints Using Dedupalog
3. Answer sets for consistent query answering in inconsistent databases
4. Scaling up all pairs similarity search
Cited by
30 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献