Efficient Checkpointing with Recompute Scheme for Non-volatile Main Memory

Author:

Alshboul Mohammad1,Elnawawy Hussein1,Elkhouly Reem2,Kimura Keiji3,Tuck James1,Solihin Yan4

Affiliation:

1. North Carolina State University, USA

2. Tanta University, Egypt and Waseda University, Japan

3. Waseda University, Japan

4. University of Central Florida, USA

Abstract

Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Reference57 articles.

1. 2016. Ruby Memory System. Retrieved from http://gem5.org/Ruby. 2016. Ruby Memory System. Retrieved from http://gem5.org/Ruby.

2. Song Ho Ahn. 2005. Convolution. Retrieved from http://www.songho.ca/dsp/convolution/convolution.html. Song Ho Ahn. 2005. Convolution. Retrieved from http://www.songho.ca/dsp/convolution/convolution.html.

3. Resistive Random Access Memory (ReRAM) Based on Metal Oxides

4. Lazy Persistency: A High-Performing and Write-Efficient Software Persistency Technique

5. Write-Aware Management of NVM-based Memory Extensions

Cited by 8 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing;ACM Transactions on Architecture and Code Optimization;2023-12-14

2. PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and Writeback;2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT);2023-10-21

3. Reconciling Selective Logging and Hardware Persistent Memory Transaction;2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2023-02

4. Pop-Crypt: Identification and Management of Popular Words for Enhancing Lifetime of EnCrypted Nonvolatile Main Memories;IEEE Transactions on Very Large Scale Integration (VLSI) Systems;2022-09

5. Clobber-NVM: log less, re-execute more;Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems;2021-04-17

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3