Balance-aware distributed string similarity-based query processing system-Reference-Cited by-同舟云学术

Balance-aware distributed string similarity-based query processing system

Published:2019-05 Issue:9 Volume:12 Page:961-974
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Sun Ji¹,Shang Zeyuan²,Li Guoliang¹,Deng Dong²,Bao Zhifeng³

Affiliation:

1. Tsinghua University

2. Tsinghua University and MIT

3. RMIT University

Abstract

Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top- k selection and top- k join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose balance-aware signatures where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1--3 orders of magnitude and has good scalability.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3329772.3329774

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reasoning on property graphs with graph generating dependencies;Information Sciences;2024-06

2. Resource Allocation in Cloud Computing Using Genetic Algorithm and Neural Network;2023 IEEE 8th International Conference on Smart Cloud (SmartCloud);2023-09-16

3. Learned Cardinality Estimation for Similarity Queries;Proceedings of the 2021 International Conference on Management of Data;2021-06-09

4. Blocking and Filtering Techniques for Entity Resolution;ACM Computing Surveys;2021-03-31

5. Internal and external memory set containment join;The VLDB Journal;2021-02-23