Affiliation:
1. Georgia Institute of Technology
2. RAS Architecture, Advanced Micro Devices Inc.
Abstract
Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related failures, memory vendors are planning to equip existing DRAM chips with On-Die ECC. For maintaining compatibility with memory standards, On-Die ECC is kept invisible from the memory controller.
This paper explores how to design high reliability memory systems in presence of On-Die ECC. We show that if On-Die ECC is not exposed to the memory system, having a 9-chip ECC-DIMM (implementing SECDED) provides almost no reliability benefits compared to an 8-chip non-ECC DIMM. We also show that if the error detection of On-Die ECC can be exposed to the memory controller, then Chipkill-level reliability can be achieved even with a 9-chip ECC-DIMM. To this end, we propose
e<u>X</u>posed On-Die <u>E</u>rror <u>D</u>etection (XED)
, which exposes the On-Die error detection information without requiring changes to the memory standards or consuming bandwidth overheads. When the On-Die ECC detects an error, XED transmits a pre-defined
"catch-word"
instead of the corrected data value. On receiving the catch-word, the memory controller uses the parity stored in the 9-chip of the ECC-DIMM to correct the faulty chip (similar to RAID-3). Our studies show that XED provides Chipkill-level reliability (172x higher than SECDED), while incurring negligible overheads, with a 21% lower execution time than Chipkill. We also show that XED can enable Chipkill systems to provide Double-Chipkill level reliability while avoiding the associated storage, performance, and power overheads.
Publisher
Association for Computing Machinery (ACM)
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Revisiting row hammer: A deep dive into understanding and resolving the issue;Microelectronics Reliability;2024-09
2. Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory Faults;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11
3. Unity ECC: Unified Memory Protection Against Bit and Chip Errors;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11
4. How to Kill the Second Bird with One ECC: The Pursuit of Row Hammer Resilient DRAM;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28
5. Predicting Future-System Reliability with a Component-Level DRAM Fault Model;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28