Affiliation:
1. The University of Hong Kong
Abstract
General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional branching on GPUs, and hinders the development of GPGPU. To achieve runtime on-the-spot branch divergence reduction, we propose the first on-GPU thread-data remapping scheme. Before kernel launching, our solution inserts codes into GPU kernels immediately before each target branch so as to acquire actual runtime divergence information. GPU software threads can be remapped to datasets multiple times during single kernel execution. We propose two thread-data remapping algorithms that are tailored to the GPU architecture. Effective on two generations of GPUs from both NVIDIA and AMD, our solution achieves speedups up to 2.718 with third-party benchmarks. We also implement three GPGPU frontier benchmarks from areas including computer vision, algorithmic trading and data analytics. They are hindered by more complex divergence coupled with different memory access patterns, and our solution works better than the traditional thread-data remapping scheme in all cases. As a compiler-assisted runtime solution, it can better reduce divergence for divergent applications that gain little acceleration on GPUs for the time being.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous Scheduling;ACM Transactions on Modeling and Computer Simulation;2024-01-14
2. Development and Verification of the GPU-Based Monte Carlo Particle Transport Program MagiC;2023 5th International Academic Exchange Conference on Science and Technology Innovation (IAECST);2023-12-08
3. Optimization Techniques for GPU Programming;ACM Computing Surveys;2023-03-16
4. Scalar Replacement Considering Branch Divergence;Journal of Information Processing;2022
5. Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters;2021 17th International Conference on Mobility, Sensing and Networking (MSN);2021-12