LSH SimilarityJoin Pattern in FastFlow-Reference-Cited by-同舟云学术

LSH SimilarityJoin Pattern in FastFlow

Published:2024-05-23 Issue:3 Volume:52 Page:207-230
ISSN:0885-7458
Container-title:International Journal of Parallel Programming
language:en
Short-container-title:Int J Parallel Prog

Author:

Tonci Nicolò,Rivault Sébastien,Bamha Mostafa,Robert Sophie,Limet Sébastien,Torquati Massimo

Abstract

AbstractSimilarity joins are recognized to be among the most used data processing and analysis operations. We introduce a C++-based high-level parallel pattern implemented on top of FastFlow Building Blocks to provide the programmer with ready-to-use similarity join computations. The SimilarityJoin pattern is implemented according to the MapReduce paradigm enriched with locality sensitive hashing (LSH) to optimize the whole computation. The new parallel pattern can be used with any C++ serializable data structure and executed on shared- and distributed-memory machines. We present experimental validations of the proposed solution considering two different clusters and small and large input datasets to evaluate in-core and out-of-core executions. The performance assessment of the SimilarityJoin pattern has been conducted by comparing the execution time against the one obtained from the original hand-tuned Hadoop-based implementation of the LSH-based similarity join algorithms as well as a Spark-based version. The experiments show that the SimilarityJoin pattern: (1) offers a significant performance improvement for small and medium datasets; (2) is competitive also for computations using large input datasets producing out-of-core executions.

Funder

Università di Pisa

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10766-024-00772-1.pdf

Reference50 articles.

1. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: 22nd International Conference on Data Engineering (2006)

2. Dey, D., Sarkar, S., De, P.: A distance-based approach to entity reconciliation in heterogeneous databases. IEEE Trans. Knowl. Data Eng. 14(3), 567–582 (2002)

3. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140 (2007)

4. Shang, Y., Li, Z., Qu, W., Xu, Y., Song, Z., Zhou, X.: Scalable collaborative filtering recommendation algorithm with mapreduce. In: 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure Computing, pp. 103–108 (2014)

5. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)