Silent error detection in numerical time-stepping schemes-Reference-Cited by-同舟云学术

Silent error detection in numerical time-stepping schemes

Published:2014-04-25 Issue:4 Volume:29 Page:403-421
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Benson Austin R¹²,Schmit Sven¹,Schreiber Robert²

Affiliation:

1. Institute for Computational and Mathematical Engineering, Stanford University, CA, USA

2. HP Labs, Palo Alto, CA, USA

Abstract

Errors due to hardware or low-level software problems, if detected, can be fixed by various schemes, such as recomputation from a checkpoint. Silent errors are errors in application state that have escaped low-level error detection. At extreme scale, where machines can perform astronomically many operations per second, silent errors threaten the validity of computed results. We propose a new paradigm for detecting silent errors at the application level. Our central idea is to frequently compare computed values to those provided by a cheap checking computation, and to build error detectors based on the difference between the two output sequences. Numerical analysis provides us with usable checking computations for the solution of initial-value problems in ODEs and PDEs, arguably the most common problems in computational science. Here, we provide, optimize, and test methods based on Runge–Kutta and linear multistep methods for ODEs, and on implicit and explicit finite difference schemes for PDEs. We take the heat equation and Navier–Stokes equations as examples. In tests with artificially injected errors, this approach effectively detects almost all meaningful errors, without significant slowdown.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342014532297

Reference20 articles.

1. Adaptive mesh refinement for hyperbolic partial differential equations

2. Soft error vulnerability of iterative linear algebra methods

3. Toward Exascale Resilience

4. Fault resilience of the algebraic multi-grid solver

5. The International Exascale Software Project roadmap

Cited by 29 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?;Future Generation Computer Systems;2024-12

2. Understanding Silent Data Corruption in Processors for Mitigating its Effects;ACM Transactions on Architecture and Code Optimization;2024-09-02

3. Reproducibility, Replicability and Repeatability: A survey of reproducible research with a focus on high performance computing;Computer Science Review;2024-08

4. Response of HPC hardware to neutron radiation at the dawn of exascale;The Journal of Supercomputing;2023-03-30

5. Resiliency in numerical algorithm design for extreme scale simulations;The International Journal of High Performance Computing Applications;2021-12-10