Affiliation:
1. University of Oran1, Algeria
2. Université Paris Descartes Sorbonnes Paris Cité, France
Abstract
One of the main challenges in data matching and data cleaning, in highly integrated systems, is
duplicates detection
. While the literature abounds of approaches detecting duplicates corresponding to the same real-world entity, most of these approaches tend to eliminate duplicates (wrong information) from the sources, hence leading to what is called
data repair.
In this article, we propose a framework that automatically detects duplicates at query time and effectively identifies the consistent version of the data, while keeping inconsistent data in the sources. Our framework uses matching dependencies (MDs) to detect duplicates through the concept of data reconciliation rules (DRR) and conditional function dependencies (CFDs) to assess the quality of different attribute values. We also build a duplicate reconciliation index (
DRI
), based on clusters of duplicates detected by a set of DRRs to speed up the online data reconciliation process. Our experiments of a real-world data collection show the efficiency and effectiveness of our framework.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献