Performance models of data parallel DAG workflows for large scale data analytics-Reference-Cited by-同舟云学术

Performance models of data parallel DAG workflows for large scale data analytics

Published:2023-05-23 Issue:3 Volume:41 Page:299-329
ISSN:0926-8782
Container-title:Distributed and Parallel Databases
language:en
Short-container-title:Distrib Parallel Databases

Author:

Shi Juwei,Lu Jiaheng

Abstract

AbstractDirected Acyclic Graph (DAG) workflows are widely used for large-scale data analytics in cluster-based distributed computing systems. The performance model for a DAG on data-parallel frameworks (e.g., MapReduce) is a research challenge because the allocation of preemptable system resources among parallel jobs may dynamically vary during execution. This resource allocation variation during execution makes it difficult to accurately estimate the execution time. In this paper, we tackle this challenge by proposing a new cost model, called Bottleneck Oriented Estimation (BOE), to estimate the allocation of preemptable resources by identifying the bottleneck to accurately predict task execution time. For a DAG workflow, we propose a state-based approach to iteratively use the resource allocation property among stages to estimate the overall execution plan. Furthermore, to handle the skewness of various jobs, we refine the model with the order statistics theory to improve estimation accuracy. Extensive experiments were performed to validate these cost models with HiBench and TPC-H workloads. The BOE model outperforms the state-of-the-art models by a factor of five for task execution time estimation. For the refined skew-aware model, the average prediction error is under

$$3\%$$

3 % when estimating the execution time of 51 hybrid analytics (HiBench) and query (TPC-H) DAG workflows.

Funder

University of Helsinki including Helsinki University Central Hospital

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Hardware and Architecture,Information Systems,Software

Link

https://link.springer.com/content/pdf/10.1007/s10619-023-07425-1.pdf

Reference40 articles.

1. Project hydrogen: unifying state-of-the-art ai and big data in apache spark. https://databricks.com/session/databricks-keynote-2

2. Apache tez. https://tez.apache.org/

3. Assefi, M., Behravesh, E., Liu, G., Tafti, A.P.: Big data machine learning using apache spark mllib. In: Proceedings of the IEEE International Conference on Big Data (Big Data), pp. 3492–3498. IEEE (2017)

4. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper. Syst. Rev. 41, 59–72 (2007)

5. Lim, H., Herodotou, H., Babu, S.: Stubby: a transformation-based optimizer for mapreduce workflows. VLDB 5(11), 1196–1207 (2012)