Affiliation:
1. Cornell University, Ithaca, NY
Abstract
The running times of many computational science applications, such as protein-folding using
ab initio
methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
43 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献