Large graph convolutional network training with GPU-oriented data communication architecture-Reference-Cited by-同舟云学术

Large graph convolutional network training with GPU-oriented data communication architecture

Published:2021-07 Issue:11 Volume:14 Page:2087-2100
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Min Seung Won¹,Wu Kun¹,Huang Sitao¹,Hidayetoğlu Mert¹,Xiong Jinjun²,Ebrahimi Eiman³,Chen Deming¹,Hwu Wen-mei¹

Affiliation:

1. UIUC

2. IBM T.J. Watson Research Center

3. NVIDIA

Abstract

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65--92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3476249.3476264

Cited by 34 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. OUTRE: An OUT-of-Core De-REdundancy GNN Training Framework for Massive Graphs within A Single Machine;Proceedings of the VLDB Endowment;2024-07

2. TIGER: Training Inductive Graph Neural Network for Large-Scale Knowledge Graph Reasoning;Proceedings of the VLDB Endowment;2024-06

3. SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data Placement;Proceedings of the ACM on Management of Data;2024-05-29

4. Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis;IEEE Transactions on Pattern Analysis and Machine Intelligence;2024-05

5. Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3;2024-04-27