Abstract
AbstractApache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.
Funder
Deutsches Zentrum für Luft- und Raumfahrt
Publisher
Springer Science and Business Media LLC
Subject
Information Systems and Management,Hardware and Architecture,Information Systems,Software
Reference32 articles.
1. Apache spark: Monitoring and instrumentation. https://spark.apache.org/docs/latest/monitoring.html (2019). Accessed 22 Feb 2019
2. Apache spark official website. https://spark.apache.org/docs/latest/configuration.html (2019). Accessed 22 Feb 2019
3. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: Proceedings of the PODS, pp. 254–263, (1998)
4. Camacho-Rodríguez, J. et al.: PigReuse: a reuse-based optimizer for Pig Latin. Technical Report, Inria Saclay (2016)
5. Chao-Qiang, H. et al.: RDDShare: reusing results of spark RDD. In: Proceedings of the DSC, pp. 370–375, (2016)
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献