Affiliation:
1. University of Tennessee, Knoxville, Knoxville, TN, USA
Abstract
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This paper proposes a new hybrid approach, based on Algorithm-Based Fault Tolerance (ABFT), to help matrix factorizations algorithms survive fail-stop failures. We consider extreme conditions, such as the absence of any reliable component and the possibility of loosing both data and checksum from a single failure. We will present a generic solution for protecting the right factor, where the updates are applied, of all above mentioned factorizations. For the left factor, where the panel has been applied, we propose a scalable checkpointing algorithm. This algorithm features high degree of checkpointing parallelism and cooperatively utilizes the checksum storage leftover from the right factor protection. The fault-tolerant algorithms derived from this hybrid solution is applicable to a wide range of dense matrix factorizations, with minor modifications. Theoretical analysis shows that the fault tolerance overhead sharply decreases with the scaling in the number of computing units and the problem size. Experimental results of LU and QR factorization on the Kraken (Cray XT5) supercomputer validate the theoretical evaluation and confirm negligible overhead, with- and without-errors.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference22 articles.
1. Fault tolerance for extreme-scale computing workshop report 2009. Fault tolerance for extreme-scale computing workshop report 2009.
2. http://www.top500.org/ 2011. http://www.top500.org/ 2011.
3. ScaLAPACK Users' Guide
4. Algorithm-based fault tolerance applied to high performance computing
Cited by
70 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint;IEEE Transactions on Parallel and Distributed Systems;2024-07
2. Extending the Legio Resilience Framework to Handle Critical Process Failures in MPI;2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP);2024-03-20
3. An overview of the Legio fault resilience framework for MPI applications;Procedia Computer Science;2024
4. Automatic Algorithm-Based Fault Tolerance (AABFT) of Stencil Computations;2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT);2023-10-21
5. The Legio Fault Resilience Framework;Proceedings of the 20th ACM International Conference on Computing Frontiers;2023-05-09