Abstract
A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation.
In this paper, we propose DIM
3
, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM
3
to consider partial result caching and support Join-
op
queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM
3
outperforms previous Join-Project solutions by a factor of 2.3X-18X. Compared to RDBMSs, DIM
3
achieves orders of magnitude speedups.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference45 articles.
1. Faster join-projects and sparse matrix multiplications
2. Multi-core, main-memory joins
3. A framework for practical parallel fast matrix multiplication
4. Efficiently updating materialized views
5. Sebastian Breß , Max Heimel , Norbert Siegmund , Ladjel Bellatreche , and Gunter Saake . 2014. Gpu-accelerated database systems: Survey and open challenges . In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV. Springer , 1--35. Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. Gpu-accelerated database systems: Survey and open challenges. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV. Springer, 1--35.
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献