Architectural core salvaging in a multi-core processor for hard-error tolerance

Author:

Powell Michael D.1,Biswas Arijit1,Gupta Shantanu2,Mukherjee Shubhendu S.1

Affiliation:

1. Intel Massachusetts, Hudson, MA, USA

2. University of Michigan, Ann Arbor, MI, USA

Abstract

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.

Publisher

Association for Computing Machinery (ACM)

Cited by 24 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

2. Timing Error Aware Register Allocation in TS;Computer Systems Science and Engineering;2022

3. IRHT: An SDC detection and recovery architecture based on value locality of instruction binary codes;Microprocessors and Microsystems;2020-09

4. Hot sparing for lifetime-chip-performance and cost improvement in application specific SIMT processors;Design Automation for Embedded Systems;2020-06-03

5. SalvageDNN: salvaging deep neural network accelerators with permanent faults through saliency-driven fault-aware mapping;Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences;2019-12-23

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3