Affiliation:
1. Lawrence Livermore National Laboratory, Livermore, CA, USA
2. Los Alamos National Laboratory, NM, USA
Abstract
The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. To address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Cited by
31 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems;Computers & Geosciences;2024-09
2. Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI;2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP);2024-03-20
3. An overview of the Legio fault resilience framework for MPI applications;Procedia Computer Science;2024
4. Implementation-Oblivious Transparent Checkpoint-Restart for MPI;Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis;2023-11-12
5. Fault-tolerance at your Finger Tips with the TeamPlay Coordination Language;The 35th Symposium on Implementation and Application of Functional Languages;2023-08-29