Abstract
AbstractDue to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is possible to continue the execution, but it requires complex integration with the application. In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. Legio exposes its features to the application transparently, removing any integration difficulty. After a fault, the execution continues only with the non-failed processes. We also propose a hierarchical alternative, which features lower repair costs on large communicators. We evaluated our solutions on the Marconi100 cluster at CINECA with benchmarks and real-world applications, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI.
Publisher
Springer Science and Business Media LLC
Subject
Hardware and Architecture,Information Systems,Theoretical Computer Science,Software
Reference28 articles.
1. Dongarra J, Beckman P, Aerts P, Cappello F, Lippert T, Matsuoka S, Messina P, Moore T, Stevens R, Trefethen A et al (2009) The international exascale software project: a call to cooperative action by the global high-performance community. Int J High Perform Comput Appl 23(4):309–322
2. Amarasinghe S, Campbell D, Carlson W, Chien A, Dally W, Elnohazy E, Hall M, Harrison R, Harrod W, Hill K et al (2009) Exascale software study: software challenges in extreme scale systems. DARPA IPTO, Air Force Research Labs, Tech. Rep 1–153
3. Zheng G, Ni X, Kalé LV (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012). IEEE, 2012, pp 1–6
4. Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B et al (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60
5. Clarke L, Glendinning I, Hempel R (1994) The mpi message passing interface standard. In: Programming environments for massively parallel distributed systems. Springer, pp 213–218
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI;2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP);2024-03-20
2. JASS: A Tunable Checkpointing System for NVM-Based Systems;2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC);2023-12-18
3. Exploit Approximation to Support Fault Resiliency in MPI-based Applications;2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W);2023-06
4. The Legio Fault Resilience Framework;Proceedings of the 20th ACM International Conference on Computing Frontiers;2023-05-09
5. Fault Awareness in the MPI 4.0 Session Model;Proceedings of the 20th ACM International Conference on Computing Frontiers;2023-05-09