A gray-box modeling methodology for runtime prediction of Apache Spark jobs-Reference-Cited by-同舟云学术

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Published:2020-03-10 Issue:4 Volume:38 Page:819-839
ISSN:0926-8782
Container-title:Distributed and Parallel Databases
language:en
Short-container-title:Distrib Parallel Databases

Author:

Al-Sayeh Hani,Hagedorn Stefan^ORCID,Sattler Kai-Uwe

Abstract

AbstractApache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

Funder

Deutsches Zentrum für Luft- und Raumfahrt

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Hardware and Architecture,Information Systems,Software

Link

http://link.springer.com/content/pdf/10.1007/s10619-020-07286-y.pdf

Reference32 articles.

1. Apache spark: Monitoring and instrumentation. https://spark.apache.org/docs/latest/monitoring.html (2019). Accessed 22 Feb 2019

2. Apache spark official website. https://spark.apache.org/docs/latest/configuration.html (2019). Accessed 22 Feb 2019

3. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: Proceedings of the PODS, pp. 254–263, (1998)

4. Camacho-Rodríguez, J. et al.: PigReuse: a reuse-based optimizer for Pig Latin. Technical Report, Inria Saclay (2016)

5. Chao-Qiang, H. et al.: RDDShare: reusing results of spark RDD. In: Proceedings of the DSC, pp. 370–375, (2016)

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data Applications;Proceedings of the VLDB Endowment;2024-07

2. TimeLink: enabling dynamic runtime prediction for Flink iterative jobs;The Journal of Supercomputing;2024-04-13

3. A Novel Multi-Task Performance Prediction Model for Spark;Applied Sciences;2023-11-11

4. KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming;2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS);2023-04

5. Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications;Proceedings of the 2022 International Conference on Management of Data;2022-06-10