XED

Author:

Nair Prashant J.1,Sridharan Vilas2,Qureshi Moinuddin K.1

Affiliation:

1. Georgia Institute of Technology

2. RAS Architecture, Advanced Micro Devices Inc.

Abstract

Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller. This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose e<u>X</u>posed On-Die <u>E</u>rror <u>D</u>etection (XED) , which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined "catch-word" instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172x higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.

Publisher

Association for Computing Machinery (ACM)

Reference49 articles.

1. ArchShield

2. Y. H. Son etal "Cidra: A cache-inspired dram resilience architecture " in HPCA 2015. Y. H. Son et al. "Cidra: A cache-inspired dram resilience architecture " in HPCA 2015 .

3. Error Detecting and Error Correcting Codes

4. M. Greenberg "Reliability availability and serviceability (ras) for ddr dram interfaces " in memcon 2014. M. Greenberg "Reliability availability and serviceability (ras) for ddr dram interfaces " in memcon 2014.

Cited by 12 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Revisiting row hammer: A deep dive into understanding and resolving the issue;Microelectronics Reliability;2024-09

2. Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory Faults;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11

3. Unity ECC: Unified Memory Protection Against Bit and Chip Errors;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11

4. How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAM;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

5. Predicting Future-System Reliability with a Component-Level DRAM Fault Model;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3