Checkpointing Workflows à la Young/Daly Is Not Good Enough-Reference-Cited by-同舟云学术

Checkpointing Workflows à la Young/Daly Is Not Good Enough

Published:2022-12-16 Issue:4 Volume:9 Page:1-25
ISSN:2329-4949
Container-title:ACM Transactions on Parallel Computing
language:en
Short-container-title:ACM Trans. Parallel Comput.

Author:

Benoit Anne¹^ORCID,Perotin Luca¹^ORCID,Robert Yves¹^ORCID,Sun Hongyang²^ORCID

Affiliation:

1. Laboratoire LIP, ENS Lyon, Lyon Cedex 07, France

2. University of Kansas, KS, USA

Abstract

This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3548607

Reference38 articles.

1. Anne Benoit Lucas Perotin Yves Robert and Hongyang Sun. 2021. Checkpointing Workflows à la Young/Daly Is Not Good Enough: Code for In-house Simulator. (June2021). https://graal.ens-lyon.fr/yrobert/simulator.zip.

2. Argonne Leadership Computing Facility (ALCF). Mira Log Traces. Retrieved from https://reports.alcf.anl.gov/data/mira.html.

3. Scientific workflows: Past, present and future

4. Scheduling computational workflows on failure-prone platforms;Aupy Guillaume;Int. J. Netw. Comput.,2016

5. Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?;Future Generation Computer Systems;2024-12

2. A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing;The International Journal of High Performance Computing Applications;2023-04-05