Fast failure recovery in distributed graph processing systems-Reference-Cited by-同舟云学术

Fast failure recovery in distributed graph processing systems

Published:2014-12 Issue:4 Volume:8 Page:437-448
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Shen Yanyan¹,Chen Gang²,Jagadish H. V.³,Lu Wei⁴,Ooi Beng Chin¹,Tudor Bogdan Marius¹

Affiliation:

1. National University of Singapore

2. Zhejiang University

3. University of Michigan

4. Renmin University

Abstract

Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpoint-based recovery by up to 30x on a cluster of 40 compute nodes.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2735496.2735506

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Swift: Expedited Failure Recovery for Large-Scale DNN Training;IEEE Transactions on Parallel and Distributed Systems;2024-09

2. An unsupervised learning-guided multi-node failure-recovery model for distributed graph processing systems;The Journal of Supercomputing;2023-01-13

3. ACF2: Accelerating Checkpoint-Free Failure Recovery for Distributed Graph Processing;Web and Big Data;2023

4. Overview of Data Synchronization and Fault Recovery Technology in Multi Active Data Center;2021 IEEE 4th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE);2021-11-19

5. A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations;IEEE Transactions on Parallel and Distributed Systems;2021-08-01