Accelerating CUDA graph algorithms at maximum warp-Reference-Cited by-同舟云学术

Accelerating CUDA graph algorithms at maximum warp

Published:2011-09-07 Issue:8 Volume:46 Page:267-276
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Hong Sungpack¹,Kim Sang Kyun¹,Oguntebi Tayo¹,Olukotun Kunle¹

Affiliation:

1. Stanford University, Stanford, USA

Abstract

Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2038037.1941590

Reference24 articles.

1. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html 2009. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html 2009.

2. http://en.wikipedia.org/wiki/GeForce_200_Series 2010. http://en.wikipedia.org/wiki/GeForce_200_Series 2010.

3. Scalable Graph Exploration on Multicore Processors

4. Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

Cited by 173 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Parallelization of butterfly counting on hierarchical memory;The VLDB Journal;2024-06-07

2. Allok: a machine learning approach for efficient graph execution on CPU–GPU clusters;The Journal of Supercomputing;2024-05-23

3. ScalaBFS2: A High-performance BFS Accelerator on an HBM-enhanced FPGA Chip;ACM Transactions on Reconfigurable Technology and Systems;2024-04-30

4. Scaling Expected Force: Efficient Identification of Key Nodes in Network-Based Epidemic Models;2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP);2024-03-20

5. AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read Mapping;Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming;2024-02-20