A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Author:

Al-Sayeh Hani,Hagedorn StefanORCID,Sattler Kai-Uwe

Abstract

AbstractApache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

Funder

Deutsches Zentrum für Luft- und Raumfahrt

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Hardware and Architecture,Information Systems,Software

Reference32 articles.

1. Apache spark: Monitoring and instrumentation. https://spark.apache.org/docs/latest/monitoring.html (2019). Accessed 22 Feb 2019

2. Apache spark official website. https://spark.apache.org/docs/latest/configuration.html (2019). Accessed 22 Feb 2019

3. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: Proceedings of the PODS, pp. 254–263, (1998)

4. Camacho-Rodríguez, J. et al.: PigReuse: a reuse-based optimizer for Pig Latin. Technical Report, Inria Saclay (2016)

5. Chao-Qiang, H. et al.: RDDShare: reusing results of spark RDD. In: Proceedings of the DSC, pp. 370–375, (2016)

Cited by 9 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data Applications;Proceedings of the VLDB Endowment;2024-07

2. TimeLink: enabling dynamic runtime prediction for Flink iterative jobs;The Journal of Supercomputing;2024-04-13

3. A Novel Multi-Task Performance Prediction Model for Spark;Applied Sciences;2023-11-11

4. KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming;2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS);2023-04

5. Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications;Proceedings of the 2022 International Conference on Management of Data;2022-06-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3