Affiliation:
1. Advanced Micro Devices Inc Austin, Austin, USA
2. Computer Science and Engineering, University of North Texas, Denton, USA
3. NVIDIA, Austin, USA
4. The George Washington University, Washington, USA
Abstract
Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this article, we design
TLPGNN
, a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,
vertex parallelism
for the first level and
feature parallelism
for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale
TLPGNN
to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1×, 7.7×, and 3.0×, respectively, on average. Evaluations of multiple-GPU
TLPGNN
also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.
Publisher
Association for Computing Machinery (ACM)
Reference59 articles.
1. Martín Abadi Paul Barham Jianmin Chen Zhifeng Chen Andy Davis Jeffrey Dean Matthieu Devin Sanjay Ghemawat Geoffrey Irving Michael Isard Manjunath Kudlur Josh Levenberg Rajat Monga Sherry Moore Derek G. Murray Benoit Steiner Paul Tucker Vijay Vasudevan Pete Warden Martin Wicke Yuan Yu and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association Savannah GA 265–283.
2. Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analysis
3. Siddhant Arora. 2020. A survey on graph neural networks for knowledge graph completion. arXiv preprint arXiv:2007.12374 (2020).
4. Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server
5. Alaa Bessadok Mohamed Ali Mahjoub and Islem Rekik. 2021. Graph neural networks in network neuroscience. arXiv preprint arXiv:2106.03535 (2021).