Affiliation:
1. Department of Computer Science, University of New Mexico, Albuquerque, USA
2. Scalable System Software Department, Sandia National Laboratories, USA
Abstract
As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. PreFlush: Lightweight Hardware Prediction Mechanism for Cache Line Flush and Writeback;2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT);2023-10-21
2. NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics;2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS);2021-05
3. Checkpointing;Fault-Tolerant Systems;2021
4. Compiler aided checkpointing using crash-consistent data structures in NVMM systems;Proceedings of the 34th ACM International Conference on Supercomputing;2020-06-29
5. Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem;2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-5);2019-11