Handling data-skewness in character based string similarity join using Hadoop-Reference-Cited by-同舟云学术

Handling data-skewness in character based string similarity join using Hadoop

Published:2020-08-04 Issue:1/2 Volume:18 Page:22-44
ISSN:2634-1964
Container-title:Applied Computing and Informatics
language:en
Short-container-title:ACI

Author:

Meena Kanak,Tayal Devendra K.,Castillo Oscar,Jain Amita

Abstract

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Publisher

Emerald

Subject

Computer Science Applications,Information Systems,Software

Reference54 articles.

1. A survey of large-scale analytical query processing in MapReduce;VLDB J,2014

2. V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors;Proc. VLDB Endow,2012

3. M. Wang, T. Nie, D. Shen, Y. Kou, G. Yu, Intelligent similarity joins for big data integration, in: Web Information System and Application Conference (WISA), 10th, IEEE, 2013, pp. 383–388.

4. From data quality to big data quality;J. Database Manage,2015

5. L. Kolb, A. Thor, E. Rahm, Load balancing for mapreduce-based entity resolution, in: 28th International Conference on Data Engineering (ICDE).2012, IEEE, 2012, pp. 618–629.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and Spark;Applied Sciences;2022-06-28

2. MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness;Scientific Programming;2021-11-24

3. Providing diagnosis on diabetes using cloud computing environment to the people living in rural areas of India;Journal of Ambient Intelligence and Humanized Computing;2021-04-01

4. Semi-Stream Similarity Join Processing in a Distributed Environment;IEEE Access;2020