Affiliation:
1. University of Virginia, Charlottesville, VA
Abstract
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. To protect against soft errors, redundancy techniques such as redundant multithreading (RMT) are often used. However, these techniques assume that the probability that a structural fault will result in a soft error (i.e., the Architectural Vulnerability Factor (AVF)) is 100 percent, unnecessarily draining processor resources. Due to the high cost of redundancy, there have been efforts to throttle RMT at runtime. To date, these methods have not incorporated an AVF model and therefore tend to be ad hoc. Unfortunately, computing the AVF of complex microprocessor structures (e.g., the ISQ) can be quite involved.
To provide probabilistic guarantees about fault tolerance, we have created a rigorous characterization of AVF behavior that can be easily implemented in hardware. We experimentally demonstrate AVF variability within and across the SPEC2000 benchmarks and identify strong correlations between structural AVF values and a small set of processor metrics. Using these simple indicators as predictors, we create a proof-of-concept RMT implementation that demonstrates that AVF prediction can be used to maintain a low fault tolerance level without significant performance impact.
Publisher
Association for Computing Machinery (ACM)
Reference35 articles.
1. NonStop® Advanced Architecture
2. Computing Architectural Vulnerability Factors for Address-Based Structures
3. D. Burger and T. Austin. The SimpleScalar Toolset Version 3.0. http://www.simplescalar.com. D. Burger and T. Austin. The SimpleScalar Toolset Version 3.0. http://www.simplescalar.com.
4. Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review
Cited by
41 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Gem5-MARVEL: Microarchitecture-Level Resilience Analysis of Heterogeneous SoC Architectures;2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2024-03-02
2. Silent Data Corruptions: Microarchitectural Perspectives;IEEE Transactions on Computers;2023-11
3. AVGI: Microarchitecture-Driven, Fast and Accurate Vulnerability Assessment;2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2023-02
4. Reliability-Aware Runahead;2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2022-04
5. Gem5Panalyzer: A Light-weight tool for Early-stage Architectural Reliability Evaluation & Prediction;2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS);2020-08