Affiliation:
1. University of Illinois, Urbana-Champaign
2. University of Illinois, Urbana-Champaign and Microsoft Research Asia
Abstract
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day.
Publisher
Association for Computing Machinery (ACM)
Cited by
21 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and Reliability;ACM Computing Surveys;2024-06-28
2. A Novel Cache and Consistency Mechanism for IoT Time Series Data;2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys);2023-12-17
3. Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading;The Journal of Supercomputing;2021-05-10
4. On Providing OS Support to Allow Transparent Use of Traditional Programming Models for Persistent Memory;ACM Journal on Emerging Technologies in Computing Systems;2020-07-14
5. Introduction;Reliable and Energy Efficient Streaming Multiprocessor Systems;2017-11-04