Affiliation:
1. University of California, Riverside, USA
Abstract
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference30 articles.
1. The International Exascale Software Project. http://www.exascale.org. The International Exascale Software Project. http://www.exascale.org.
2. Coordinated Infrastructure for Fault Tolerant Systems. http://www.mcs.anl.gov/research/cifts. Coordinated Infrastructure for Fault Tolerant Systems. http://www.mcs.anl.gov/research/cifts.
3. MPICH-V. http://mpich-v.lri.fr. MPICH-V. http://mpich-v.lri.fr.
4. Cooperative Application/OS DRAM Fault Recovery
5. Automated application-level checkpointing of MPI programs
Cited by
86 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献