Efficient Discovery of Matching Dependencies-Reference-Cited by-同舟云学术

Efficient Discovery of Matching Dependencies

Published:2020-09-25 Issue:3 Volume:45 Page:1-33
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Schirmer Philipp¹,Papenbrock Thorsten¹,Koumarelas Ioannis¹,Naumann Felix¹

Affiliation:

1. Hasso Plattner Institute, University of Potsdam, Germany

Abstract

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets. We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3392778

Reference38 articles.

1. ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

2. Adaptive name matching in information integration

3. Relaxed Functional Dependencies—A Survey of Approaches

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Towards declarative comparabilities: Application to functional dependencies;Journal of Computer and System Sciences;2024-12

2. Efficient Set-Based Order Dependency Discovery with a Level-Wise Hybrid Strategy;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

3. Rapidash: Efficient Detection of Constraint Violations;Proceedings of the VLDB Endowment;2024-04

4. Efficient Differential Dependency Discovery;Proceedings of the VLDB Endowment;2024-03

5. Splitting Tuples of Mismatched Entities;Proceedings of the ACM on Management of Data;2023-12-08