Architectural core salvaging in a multi-core processor for hard-error tolerance-Reference-Cited by-同舟云学术

Architectural core salvaging in a multi-core processor for hard-error tolerance

Published:2009-06-15 Issue:3 Volume:37 Page:93-104
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Powell Michael D.¹,Biswas Arijit¹,Gupta Shantanu²,Mukherjee Shubhendu S.¹

Affiliation:

1. Intel Massachusetts, Hudson, MA, USA

2. University of Michigan, Ann Arbor, MI, USA

Abstract

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/1555815.1555769

Reference27 articles.

1. A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

2. The 65nm 16MB On-Die L3 Cache for a Dual Core Multi-Threaded Xeon/sup ~/ Processor

Cited by 24 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

2. Timing Error Aware Register Allocation in TS;Computer Systems Science and Engineering;2022

3. IRHT: An SDC detection and recovery architecture based on value locality of instruction binary codes;Microprocessors and Microsystems;2020-09

4. Hot sparing for lifetime-chip-performance and cost improvement in application specific SIMT processors;Design Automation for Embedded Systems;2020-06-03

5. SalvageDNN: salvaging deep neural network accelerators with permanent faults through saliency-driven fault-aware mapping;Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences;2019-12-23