TokenJoin-Reference-Cited by-同舟云学术

TokenJoin

Published:2022-12 Issue:4 Volume:16 Page:790-802
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zeakis Alexandros¹,Skoutas Dimitrios²,Sacharidis Dimitris³,Papapetrou Odysseas⁴,Koubarakis Manolis⁵

Affiliation:

1. National and Kapodistrian University of Athens & "Athena" RC, Greece

2. "Athena" RC, Greece

3. Université Libre de Bruxelles, Belgium

4. Eindhoven University of Technology, Netherlands

5. National and Kapodistrian University of Athens, Greece

Abstract

Set similarity join is an important problem with many applications in data discovery, cleaning and integration. To increase robustness, fuzzy set similarity join calculates the similarity of two sets based on maximum weighted bipartite matching instead of set overlap. This allows pairs of elements, represented as sets or strings, to also match approximately rather than exactly, e.g., based on Jaccard similarity or edit distance. However, this significantly increases the verification cost, making even more important the need for efficient and effective filtering techniques to reduce the number of candidate pairs. The current state-of-the-art algorithm relies on similarity computations between pairs of elements to filter candidates. In this paper, we propose token-based instead of element-based filtering, showing that it is significantly more lightweight, while offering similar or even better pruning effectiveness. Moreover, we address the top- k variant of the problem, alleviating the need for a user-specified similarity threshold. We also propose early termination to reduce the cost of verification. Our experimental results on six real-world datasets show that our approach always outperforms the state of the art, being an order of magnitude faster on average.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3574245.3574263

Reference34 articles.

1. Arvind Arasu Venkatesh Ganti and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In VLDB. 918--929. Arvind Arasu Venkatesh Ganti and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In VLDB. 918--929.

2. Roberto J Bayardo Yiming Ma and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In WWW. 131--140. Roberto J Bayardo Yiming Ma and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In WWW. 131--140.

3. Spatio-textual similarity joins

4. Surajit Chaudhuri Venkatesh Ganti and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE. 5. Surajit Chaudhuri Venkatesh Ganti and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE. 5.

5. Tobias Christiani Rasmus Pagh and Johan Sivertsen. 2018. Scalable and Robust Set Similarity Join. In ICDE. 1240--1243. Tobias Christiani Rasmus Pagh and Johan Sivertsen. 2018. Scalable and Robust Set Similarity Join. In ICDE. 1240--1243.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. BipartiteJoin: Optimal Similarity Join for Fuzzy Bipartite Matching;Lecture Notes in Networks and Systems;2024