CoPart: a context-based partitioning technique for big data-Reference-Cited by-同舟云学术

CoPart: a context-based partitioning technique for big data

Published:2021-01-19 Issue:1 Volume:8 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Migliorini Sara^ORCID,Belussi Alberto,Quintarelli Elisa,Carra Damiano

Abstract

AbstractThe MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

http://link.springer.com/content/pdf/10.1186/s40537-021-00410-4.pdf

Reference31 articles.

1. White T. Hadoop: the definitive guide. 4th edn. O’Reilly Media, Inc.; 2015.

2. Chambers B, Zaharia M. Spark: the definitive guide big data processing made simple. 1st ed. O’Reilly Media, Inc.; 2018.

3. Alarabi L, Mokbel MF, Musleh M. ST-Hadoop: a MapReduce framework for spatio-temporal data. GeoInformatica. 2018;22(4):785–813.

4. Bakli M, Sakr M, Soliman TH. HadoopTrajectory: a Hadoop spatiotemporal data processing extension. J Geogr Syst. 2019;21(2):211–35.

5. Beck M, Hao W, Campan A. Accelerating the mobile cloud: using amazon mobile analytics and k-means clustering. In: 2017 IEEE 7th annual computing and communication workshop and conference (CCWC); 2017. p. 1–7.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Block size estimation for data partitioning in HPC applications using machine learning techniques;Journal of Big Data;2024-01-16

2. Big Data Management System Architectures: From Opportunities to Challenges [Vision];2023 IEEE International Conference on Big Data (BigData);2023-12-15

3. Tracking social provenance in chains of retweets;Knowledge and Information Systems;2023-05-09

4. Time-Aware Data Partition Optimization and Heterogeneous Task Scheduling Strategies in Spark Clusters;The Computer Journal;2023-03-14