MDedup-Reference-Cited by-同舟云学术

MDedup

Published:2020-01 Issue:5 Volume:13 Page:712-725
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Koumarelas loannis¹,Papenbrock Thorsten¹,Naumann Felix¹

Affiliation:

1. University of Potsdam, Germany

Abstract

Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are effective, but they are also hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific, which is a problem if a new dataset needs to be cleaned. For this reason, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a model that selects MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-based data cleaning approaches, we propose an additional boosting step. Our experiments show that this approach reaches up to 94% F-measure and 100% precision on our evaluation datasets, which are good numbers considering that the system does not require domain or target data-specific configuration.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3377369.3377379

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Low-resource entity resolution with domain generalization and active learning;Neurocomputing;2024-09

2. An efficient learning based approach for automatic record deduplication with benchmark datasets;Scientific Reports;2024-07-15

3. Extending Desbordante with Probabilistic Functional Dependency Discovery Support;2024 35th Conference of Open Innovations Association (FRUCT);2024-04-24

4. Efficient Differential Dependency Discovery;Proceedings of the VLDB Endowment;2024-03

5. Class Ratio and Its Implications for Reproducibility and Performance in Record Linkage;Lecture Notes in Computer Science;2024