A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS-Reference-Cited by-同舟云学术

A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS

Published:2013-12 Issue:04 Volume:23 Page:1340011
ISSN:0129-6264
Container-title:Parallel Processing Letters
language:en
Short-container-title:Parallel Process. Lett.

Author:

SHAHZAD FAISAL¹,WITTMANN MARKUS¹,KREUTZER MORITZ¹,ZEISER THOMAS¹,HAGER GEORG¹,WELLEIN GERHARD¹

Affiliation:

1. Erlangen Regional Computing Center, University of Erlangen-Nuremberg, 91058 Erlangen, Germany

Abstract

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied.

Publisher

World Scientific Pub Co Pte Lt

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0129626413400112

Reference11 articles.

1. The International Exascale Software Project roadmap

2. MTIO. A multi-threaded parallel I/O system

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. On the Performance of Malleable APGAS Programs and Batch Job Schedulers;SN Computer Science;2024-03-27

2. Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters;SN Computer Science;2024-03-13

3. Task-Level Checkpointing for Nested Fork-Join Programs Using Work Stealing;Lecture Notes in Computer Science;2024

4. Malleable APGAS Programs and Their Support in Batch Job Schedulers;Lecture Notes in Computer Science;2024

5. Task-Level Resilience: Checkpointing vs. Supervision;International Journal of Networking and Computing;2022