Using formal verification to evaluate theexecution time of Spark applications

Author:

Baresi L.1,Bersani M. M.1ORCID,Marconi F.1,Quattrocchi G.1,Rossi M.2

Affiliation:

1. Dipartimento di Elettronica Informazione e Bioingegneria, Politecnico di Milano, Via Golgi 42, 20133, Milan, Italy

2. Dipartimento di Meccanica, Politecnico di Milano, via La Masa 1, 20156, Milano, Italy

Abstract

Abstract Apache Spark is probably the most widely adopted framework for developing big-data batch applications and for executing them on a cluster of (virtual) machines. In general, the more resources (machines) one uses, the faster applications execute, but there is currently no adequate means to determine the proper size of a Spark cluster given time constraints, or to foresee execution times given the number of employed machines. One can only run these applications and use her/his experience to size the cluster and predict expected execution times. Wrong estimation of execution times can lead to costly overruns and overly long executions, thus calling for analytic sizing/prediction techniques that provide precise time guarantees. This paper addresses this problem by proposing a solution based on model-checking. The approach exploits a directed acyclic graph (DAG) to abstract the structure of the execution flows of Spark programs, annotates each node (Spark stage) with execution-related data, and formulates the identification of the global execution time as a reachability problem. To avoid the well-known state space explosion problem, the paper also proposes a technique to reduce the size of generated abstract models. This results in a significant decrease in used memory and/or verification time making our approach feasible for predicting the execution time of Spark applications given the resources available. The benefits of the proposed reduction technique are evaluated by using both timed automata and constraint LTL over clocks logic to formally encode and analyze generated models. The approach is also successfully validated on some realistic case studies. Since the optimization is not Spark-specific, we claim that it can be applied to a wide range of applications whose underlying model can be abstracted as a DAG.

Funder

Horizon 2020

Publisher

Association for Computing Machinery (ACM)

Subject

Theoretical Computer Science,Software

Reference47 articles.

1. Model-Checking in Dense Real-Time

2. A theory of timed automata

3. Brito A Ardagna D Blanquer I Evangelinou A Barbierato E Gribaudo M Almeida J Couto AP Braga T (2017) D3.4 EUBra-BIGSEA QoS infrastructure services intermediate version. Technical report EUBra-BIGSEA consortium

4. Behrmann G David A Larsen KG Hakansson J Petterson P Yi W Hendriks M (2006) Uppaal 4.0. In: Proceedings of the 3rd international conference on the quantitative evaluation of systems QEST '06 Washington DC USA. IEEE Computer Society pp 125–126

Cited by 5 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization;Knowledge and Information Systems;2023-12-13

2. DAG-Based Formal Modeling of Spark Applications with MSVL;Information;2023-12-12

3. Detecting Data Anomalies from Their Formal Specifications: A Case Study in IoT Systems;Electronics;2023-01-27

4. Energy big data automatic desensitization model based on Spark parallel computing framework;2021 2nd International Conference on Big Data Economy and Information Management (BDEIM);2021-12

5. Formalizing Spark Applications with MSVL;Structured Object-Oriented Formal Language and Method;2021

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3