Fault Tolerance Technique Offlining Faulty Blocks by Heap Memory Management

Author:

Jun Jaeyung1ORCID,Paik Yoonah1,Min Gyeong Il1,Kim Seon Wook1,Han Youngsun2

Affiliation:

1. Korea University, Seoul, Korea

2. Kyungil University, Gyeongsan, Korea

Abstract

As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate. In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

Funder

IT R8D program of MOTIE/KEIT

Design technology development of ultra-low voltage operating circuit and IP for smart sensor SoC

Publisher

Association for Computing Machinery (ACM)

Subject

Electrical and Electronic Engineering,Computer Graphics and Computer-Aided Design,Computer Science Applications

Reference34 articles.

1. Mcelog {n.d.}. Advanced hardware error handling for x86 Linux. Retrieved from http://www.mcelog.org/badpageofflining.html. Mcelog {n.d.}. Advanced hardware error handling for x86 Linux. Retrieved from http://www.mcelog.org/badpageofflining.html.

2. Linux Kernel Archives {n.d.}. Page migration. Retrieved from https://www.kernel.org/doc/Documentation/vm/page_migration. Linux Kernel Archives {n.d.}. Page migration. Retrieved from https://www.kernel.org/doc/Documentation/vm/page_migration.

3. Efficient Memory Repair Using Cache-Based Redundancy

4. Refresh Now and Then

5. QEMU: A multihost, multitarget emulator;Bartholomew Daniel;Linux,2006

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Generating Representative Test Sequences from Real Workload for Minimizing DRAM Verification Overhead;ACM Transactions on Design Automation of Electronic Systems;2020-09-02

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3