Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++-Reference-Cited by-同舟云学术

Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

Published:2006-04 Issue:2 Volume:40 Page:90-99
ISSN:0163-5980
Container-title:ACM SIGOPS Operating Systems Review
language:en
Short-container-title:SIGOPS Oper. Syst. Rev.

Author:

Zheng Gengbin¹,Huang Chao¹,Kalé Laxmikant V.¹

Affiliation:

1. University of Illinois at Urbana-Champaign

Abstract

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/1131322.1131340

Reference24 articles.

1. NR Adiga G Almasi GS Almasi Y Aridor R Barik D Beece R Bellofatto G Bhanot R Bickford M Blumrich AA Bright and J. An overview of the bluegene/1 supercomputer 2002.]] NR Adiga G Almasi GS Almasi Y Aridor R Barik D Beece R Bellofatto G Bhanot R Bickford M Blumrich AA Bright and J. An overview of the bluegene/1 supercomputer 2002.]]

2. Milind Bhandarkar and L. V. Kalé . A Parallel Framework for Explicit FEM. In M. Valero V. K. Prasanna and S. Vajpeyam editors Proceedings of the International Conference on High Performance Computing (HiPC 2000 ) Lecture Notes in Computer Science volume 1970 pages 385 -- 395 . Springer Verlag December 2000.]] Milind Bhandarkar and L. V. Kalé. A Parallel Framework for Explicit FEM. In M. Valero V. K. Prasanna and S. Vajpeyam editors Proceedings of the International Conference on High Performance Computing (HiPC 2000) Lecture Notes in Computer Science volume 1970 pages 385--395. Springer Verlag December 2000.]]

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The Template Task Graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale;2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2);2020-11

2. Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance;Lecture Notes in Computer Science;2020

3. Checkpoint/restart approaches for a thread-based MPI runtime;Parallel Computing;2019-07

4. Improving resilience of scientific software through a domain-specific approach;Journal of Parallel and Distributed Computing;2019-06

5. Transparent High-Speed Network Checkpoint/Restart in MPI;Proceedings of the 25th European MPI Users' Group Meeting;2018-09-23