Affiliation:
1. Virginia Tech
2. Oak Ridge National Laboratory
Abstract
This article presents Clover, a compiler-directed soft error detection and recovery scheme for lightweight soft error resilience. The compiler carefully generates soft-error-tolerant code based on idempotent processing without explicit checkpoints. During program execution, Clover relies on a small number of acoustic wave detectors deployed in the processor to identify soft errors by sensing the wave made by a particle strike. To cope with DUEs (detected unrecoverable errors) caused by the sensing latency of error detection, Clover leverages a novel selective instruction duplication technique called tail-DMR (dual modular redundancy) that provides a region-level error containment. Once a soft error is detected by either the sensors or the tail-DMR, Clover takes care of the error as in the case of exception handling. To recover from the error, Clover simply redirects program control to the beginning of the code region where the error is detected. The experimental results demonstrate that the average runtime overhead is only 26%, which is a 75% reduction compared to that of the state-of-the-art soft error resilience technique. In addition, this article evaluates an alternative technique called tail-wait, comparing it to Clover. According to the evaluation with the different processor configurations and the various error detection latencies, Clover turns out to be a superior technique, achieving 1.06 to 3.49 × speedup over the tail-wait.
Funder
National Science Foundation
U.S. Department of Energy
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Software
Reference67 articles.
1. ARM. 2015. Cortex-A57 Technique Reference Manual. Retrieved from http://infocenter.arm.com/help/ index.jsp?topic=/com.arm.doc.ddi0488g/index.html. ARM. 2015. Cortex-A57 Technique Reference Manual. Retrieved from http://infocenter.arm.com/help/ index.jsp?topic=/com.arm.doc.ddi0488g/index.html.
2. The gem5 simulator
3. End-to-end register data-flow continuous self-test
4. Boosting efficiency of fault detection and recovery throughapplication-specific comparison and checkpointing
Cited by
18 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献