Checkpointing strategies to tolerate non-memoryless failures on HPC platforms-Reference-Cited by-同舟云学术

Checkpointing strategies to tolerate non-memoryless failures on HPC platforms

Published:2023-09-22 Issue: Volume: Page:
ISSN:2329-4949
Container-title:ACM Transactions on Parallel Computing
language:en
Short-container-title:ACM Trans. Parallel Comput.

Author:

Benoit Anne¹,Perotin Lucas¹,Robert Yves¹,Vivien Frédéric¹

Affiliation:

1. Laboratoire LIP, ENS Lyon & Inria Lyon, France

Abstract

This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3624560

Reference47 articles.

1. Checkpointing Strategies for Scheduling Computational Workflows

2. Guillaume Aupy Yves Robert and Frédéric Vivien. 2017. Assuming failure independence: are we right to be wrong?. In FTS’2017. Guillaume Aupy Yves Robert and Frédéric Vivien. 2017. Assuming failure independence: are we right to be wrong?. In FTS’2017.

3. L. Bautista-Gomez A. Gainaru S. Perarnau D. Tiwari S. Gupta C. Engelmann F. Cappello and M. Snir. 2016. Reducing Waste in Extreme Scale Systems through Introspective Analysis. In IPDPS. IEEE 212–221. L. Bautista-Gomez A. Gainaru S. Perarnau D. Tiwari S. Gupta C. Engelmann F. Cappello and M. Snir. 2016. Reducing Waste in Extreme Scale Systems through Introspective Analysis. In IPDPS. IEEE 212–221.

4. FTI

5. Towards Optimal Multi-Level Checkpointing