Affiliation:
1. INRIA/LRI, Université Paris-Sud, Orsay, France
Abstract
High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We then present four fault-tolerant protocols implemented in a new generic framework for fault-tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a micro-benchmark and compare them with the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault-tolerant protocol comparison of MPI applications.
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Cited by
66 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI;2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP);2024-03-20
2. An overview of the Legio fault resilience framework for MPI applications;Procedia Computer Science;2024
3. Elastic deep learning through resilient collective operations;Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis;2023-11-12
4. Implementation-Oblivious Transparent Checkpoint-Restart for MPI;Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis;2023-11-12
5. Exploit Approximation to Support Fault Resiliency in MPI-based Applications;2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W);2023-06