Abstract
Two opposite approaches were proposed to address some scalability problem resulting from coordinated checkpointing's synchronization during failure-free operation: minimizing the number of checkpointing participants and having the checkpointing process non-blocking. However, these previous approaches, oblivious to the underlying network, may not fundamentally provide any breakthrough for ensuring high scalability required in very large-scale P2P-based systems. This paper proposes a non-blocking coordinated checkpointing protocol to significantly reduce checkpointing synchronization overhead by structuring the peer-to-peer network into a set of groups according to a particular criterion. In this protocol, among processes in a group, one is designated as representative with the following special roles, intra-group and inter-group checkpointing coordination. Intra-group checkpointing coordination addresses the checkpointing procedure among processes within a group. On the other hand, inter-group checkpointing coordination is performed only among representatives. Thanks to this beneficial feature, the proposed protocol may considerably reduce the number of checkpointing control messages routed on core networks compared with the existing ones.
Publisher
Trans Tech Publications, Ltd.