A Multi-Input File Data Symmetry Placement Method Considering Job Execution Frequency for MapReduce Join Operation-Reference-Cited by-同舟云学术

A Multi-Input File Data Symmetry Placement Method Considering Job Execution Frequency for MapReduce Join Operation

Published:2022-12-15 Issue:15 Volume:36 Page:
ISSN:0218-0014
Container-title:International Journal of Pattern Recognition and Artificial Intelligence
language:en
Short-container-title:Int. J. Patt. Recogn. Artif. Intell.

Author:

Wu Jia-Xuan¹,Zhang Yu-Zhu¹,Jiang Yue-Qiu¹^ORCID,Zhang Xin²

Affiliation:

1. School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110819, P. R. China

2. School of Automobile and Traffic, Shenyang Ligong University, Shenyang 110819, P. R. China

Abstract

In recent years, data-parallel computing frameworks such as Hadoop have become increasingly popular among scientists. Data-grouping-aware multiple input file data placement for Hadoop is becoming increasingly popular. However, we note that many data-grouping-aware data placement schemes for multiple input files do not take MapReduce job execution frequency into account. Through the study, such data placement schemes will increase the data transmission between nodes. The starting point of this paper is that if a certain type of MapReduce job has been executed more frequently recently, then it can be assumed that this type of job will also have a higher chance of being executed later. Based on this assumption, we proposed a data-grouping-aware multiple input files data symmetry placement method based on MapReduce jobs execution frequency (DGAMF). Based on the history of MapReduce job executions, this method first creates an inter-block join access correlation model, then divides the correlated blocks into groups according to this model and gives a mathematical model for data placement. The model can be used to guide the placement of data blocks centrally to solve the node load balancing issue caused by data asymmetry. Using the proposed method, correlated blocks from the same groups were placed in the same set of nodes, thereby effectively reducing the amount of data transmitted between nodes. Our data placement method was validated by setting up an experimental Hadoop environment. Experimental results showed that the proposed method effectively processed massive datasets and improved MapReduce’s efficiency significantly.

Funder

the General Young Talents Project for Scientific Research grant of the Educational Department of Liaoning Province

the Research Support Program for Inviting High-Level Talents grant of Shenyang Ligong University

Publisher

World Scientific Pub Co Pte Ltd

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218001422590376

Reference32 articles.

1. MRA++: Scheduling and data placement on MapReduce for heterogeneous environments

2. UnifyDR: A Generic Framework for Unifying Data and Replica Placement