Transient fault detection via simultaneous multithreading-Reference-Cited by-同舟云学术

Transient fault detection via simultaneous multithreading

Published:2000-05 Issue:2 Volume:28 Page:25-36
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Reinhardt Steven K.¹,Mukherjee Shubhendu S.²

Affiliation:

1. EECS Department, University of Michigan, Ann Arbor, 1301 Beal Avenue, Ann Arbor, MI

2. VSSAD, Alpha Technology Group, Compaq Computer Corporation, 334 South Street, Mail Stop SHR3-2E/R28, Shrewsbury, MA

Abstract

Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations. We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor—derived from a Simultaneous Multithreaded (SMT) processor—provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies of the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes it difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order. This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstracts both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms—slack fetch and branch outcome queue—that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with a maximum benefit of up to 29%.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/342001.339652

Reference20 articles.

1. Concurrent error detection using watchdog processors-a survey

Cited by 36 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Bare-Metal Redundant Multi-Threading on Multicore SoCs Under Neutron Irradiation;IEEE Transactions on Nuclear Science;2023-08

2. SafeLS: An Open Source Implementation of a Lockstep NOEL-V RISC-V Core;2023 IEEE 29th International Symposium on On-Line Testing and Robust System Design (IOLTS);2023-07-03

3. Supervised Triple Macrosynchronized Lockstep (STMLS) Architecture for Multicore Processors;IEEE Access;2023

4. Fault-Tolerant General Purposed Processors;Built-in Fault-Tolerant Computing Paradigm for Resilient Large-Scale Chip Design;2023

5. Hybrid Lockstep Technique for Soft Error Mitigation;IEEE Transactions on Nuclear Science;2022-07