ByteGNN-Reference-Cited by-同舟云学术

ByteGNN

Published:2022-02 Issue:6 Volume:15 Page:1228-1242
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zheng Chenguang¹,Chen Hongzhi²,Cheng Yuxuan²,Song Zhezheng³,Wu Yifan⁴,Li Changji¹,Cheng James³,Yang Hao²,Zhang Shuai²

Affiliation:

1. The Chinese University of Hong Kong and ByteDacne Inc

2. ByteDacne Inc

3. The Chinese University of Hong Kong

4. ByteDacne Inc and Peking University

Abstract

Graph neural networks (GNNs) have shown excellent performance in a wide range of applications such as recommendation, risk control, and drug discovery. With the increase in the volume of graph data, distributed GNN systems become essential to support efficient GNN training. However, existing distributed GNN training systems suffer from various performance issues including high network communication cost, low CPU utilization, and poor end-to-end performance. In this paper, we propose ByteGNN, which addresses the limitations in existing distributed GNN systems with three key designs: (1) an abstraction of mini-batch graph sampling to support high parallelism, (2) a two-level scheduling strategy to improve resource utilization and to reduce the end-to-end GNN training time, and (3) a graph partitioning algorithm tailored for GNN workloads. Our experiments show that ByteGNN outperforms the state-of-the-art distributed GNN systems with up to 3.5--23.8 times faster end-to-end execution, 2--6 times higher CPU utilization, and around half of the network communication cost.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3514061.3514069

Reference81 articles.

1. 2019. Euler. https://github.com/alibaba/euler. 2019. Euler. https://github.com/alibaba/euler.

2. 2019. Gremlin. http://tinkerpop.apache.org/gremlin.html. 2019. Gremlin. http://tinkerpop.apache.org/gremlin.html.

3. 2019. TinkerPop. http://tinkerpop.apache.org/. 2019. TinkerPop. http://tinkerpop.apache.org/.

4. 2020. GraphLearn. https://github.com/alibaba/graph-learn. 2020. GraphLearn. https://github.com/alibaba/graph-learn.

5. Martín Abadi , Paul Barham , Jianmin Chen , Zhifeng Chen , Andy Davis , Jeffrey Dean , Matthieu Devin , Sanjay Ghemawat , Geoffrey Irving , Michael Isard , Manjunath Kudlur , Josh Levenberg , Rajat Monga , Sherry Moore , Derek Gordon Murray , Benoit Steiner , Paul A. Tucker , Vijay Vasudevan , Pete Warden , Martin Wicke , Yuan Yu , and Xiaoqiang Zheng . 2016 . TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016 , Savannah, GA, USA, November 2--4 , 2016. USENIX Association, 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016. USENIX Association, 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

Cited by 42 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Systems for Scalable Graph Analytics and Machine Learning: Trends and Methods;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

2. MSPipe: Efficient Temporal GNN Training via Staleness-Aware Pipeline;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

3. GNNDrive: Reducing Memory Contention and I/O Congestion for Disk-based GNN Training;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12

4. BGS: Accelerate GNN training on multiple GPUs;Journal of Systems Architecture;2024-08

5. GE ² : A General and Efficient Knowledge Graph Embedding Learning System;Proceedings of the ACM on Management of Data;2024-05-29