Affiliation:
1. Laboratoire LIP, ENS Lyon, Lyon Cedex 07, France
2. University of Kansas, KS, USA
Abstract
This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.
Publisher
Association for Computing Machinery (ACM)
Subject
Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software
Reference38 articles.
1. Anne Benoit Lucas Perotin Yves Robert and Hongyang Sun. 2021. Checkpointing Workflows à la Young/Daly Is Not Good Enough: Code for In-house Simulator. (June2021). https://graal.ens-lyon.fr/yrobert/simulator.zip.
2. Argonne Leadership Computing Facility (ALCF). Mira Log Traces. Retrieved from https://reports.alcf.anl.gov/data/mira.html.
3. Scientific workflows: Past, present and future
4. Scheduling computational workflows on failure-prone platforms;Aupy Guillaume;Int. J. Netw. Comput.,2016
5. Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献