A Trie Based Set Similarity Query Algorithm-Reference-Cited by-同舟云学术

A Trie Based Set Similarity Query Algorithm

Published:2023-01-02 Issue:1 Volume:11 Page:229
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Jia Lianyin^ORCID,Tang Junzhuo,Li Mengjuan,Li Runxin,Ding Jiaman,Chen Yinong

Abstract

Set similarity query is a primitive for many applications, such as data integration, data cleaning, and gene sequence alignment. Most of the existing algorithms are inverted index based, they usually filter unqualified sets one by one and do not have sufficient support for duplicated sets, thus leading to low efficiency. To solve this problem, this paper designs T-starTrie, an efficient trie based index for set similarity query, which can naturally group sets with the same prefix into one node, and can filter all sets corresponding to the node at a time, thereby significantly improving the candidates generation efficiency. In this paper, we find that the set similarity query problem can be transformed into matching nodes of the first-layer (FMNodes) detecting problem on T-starTrie. Therefore, an efficient FLMNode detection algorithm is designed. Based on this, an efficient set similarity query algorithm, TT-SSQ, is implemented by developing a variety of filtering techniques. Experimental results show that TT-SSQ can be up to 3.10x faster than existing algorithms.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/11/1/229/pdf

Reference22 articles.

1. Chaudhuri, S., Ganti, V., and Kaushik, R. (2006, January 3–7). A Primitive Operator for Similarity Joins in Data Cleaning. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA.

2. Efficient and Scalable Processing of String Similarity Join;Rong;IEEE Trans. Knowl. Data Eng.,2013

3. Bayardo, R.J., Ma, Y., and Srikant, R. (2007, January 8–12). Scaling up all pairs similarity search. Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada.

4. Xiao, C., Wang, W., Lin, X., and Yu, J.X. (2008, January 21–25). Efficient similarity joins for near duplicate detection. Proceedings of the 17th International Conference on World Wide Web, Beijing, China.

5. Wang, J., Li, G., and Feng, J. (2012, January 20–24). Can we beat the prefix filtering?: An adaptive framework for similarity join and search. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient List Intersection Algorithm for Short Documents by Document Reordering;Mathematics;2024-04-26