Abstract
Dependable systems usually rely on replication to provide resilience and availability. However, for long-lived systems, replication is not enough since given a sufficient amount of time, there might be more faulty replicas than the threshold tolerated in the system. In order to overcome this limitation, checkpoint and recovery techniques are used to update and resume failed replicas. In this sense, checkpointing procedures periodically capture snapshots of the system state during failure-free execution, enabling recovery processes to resume from a previously stored and consistent state. Nevertheless, saving checkpoints introduces overhead, requiring synchronization with the processing of incoming requests to prevent inconsistencies. This overhead becomes even more pronounced in high-throughput systems like Parallel State Machine Replication, where workloads dominated by independent requests leverage multi-threading parallelism. This work addresses the costly nature of checkpointing by proposing a novel approach that divides the replica's state into partitions and takes snapshots of only a few partitions at a time. Replicas continue executing requests targeted to other partitions without interruption. Thus, incoming requests experience delays during a checkpoint only if they access a partition currently being saved. Combining this approach with the Parallel State Machine Replication yields reduced snapshot durations and lower client latency during checkpointing. Additionally, the proposed approach accelerates replicas recovery through collaborative state transfer, enabling workload distribution among replicas and parallel execution of transfer and installation of the recovering state.
Publisher
Sociedade Brasileira de Computacao - SB
Reference56 articles.
1. Aguilera, M. K., Chen, W., and Toueg, S. (2000). Failure detection and consensus in the crash-recovery model. Distributed computing, 13:99-125. DOI: 10.1007/s004460050070.
2. Alchieri, E., Dotti, F., Mendizabal, O. M., and Pedone, F. (2017). Reconfiguring parallel state machine replication. In 2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS), pages 104-113. IEEE. DOI: 10.1109/SRDS.2017.23.
3. Amazon (2012a). Summary of the december 24, 2012 amazon elb service event in the us-east region. Available online [link].
4. Amazon (2012b). Summary of windows azure service disruption on feb 29th. Available online [link].
5. Bessani, A., Sousa, J., and Alchieri, E. E. P. (2014). State machine replication for the masses with bft-smart. In DSN, pages 355-362. DOI: 10.1109/DSN.2014.43.