Abstract
Determining if two sets are related - that is, if they have similar values or if one set contains the other -- is an important problem with many applications in data cleaning, data integration, and information retrieval. For example, set relatedness can be a useful tool to discover whether columns from two different databases are joinable; if enough of the values in the columns match, it may make sense to join them. A common metric is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking
O
(
n
3
) time, where
n
is the number of elements in the sets, for
each
set-to-set comparison. Thus for applications that try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large.
To address this challenge, we developed S
ilk
M
oth
, a system capable of rapidly discovering related set pairs in collections of sets. Internally, S
ilk
M
oth
creates a signature for each set, with the property that any other set which is related must match the signature. S
ilk
M
oth
then uses these signatures to prune the search space, so only sets that match the signatures are left as candidates. Finally, S
ilk
M
oth
applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. An important property of S
ilk
M
oth
is that it is guaranteed to output exactly the same related set pairings as the brute-force method, unlike approximate techniques. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification. In addition, we introduce a simple optimization to the calculation of the maximum matching metric itself based on the triangle inequality. Compared to related approaches, S
ilk
M
oth
is much more general, handling a larger space of similarity functions and relatedness metrics, and is an order of magnitude more efficient on real datasets.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Determining the Largest Overlap between Tables;Proceedings of the ACM on Management of Data;2024-03-12
2. R2D2: Reducing Redundancy and Duplication in Data Lakes;Proceedings of the ACM on Management of Data;2023-12-08
3. DeepJoin: Joinable Table Discovery with Pre-Trained Language Models;Proceedings of the VLDB Endowment;2023-06
4. Koios: Top-k Semantic Overlap Set Search;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04
5. TokenJoin;Proceedings of the VLDB Endowment;2022-12