Affiliation:
1. Institute of Software, Chinese Academy of Sciences, Beijing, China
Abstract
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
44 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Zero-Overhead Parallel Scans for Multi-Core CPUs;Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores;2024-03-03
2. X-TED: Massive Parallelization of Tree Edit Distance;Proceedings of the VLDB Endowment;2024-03
3. Performance Tuning for GPU-Embedded Systems: Machine-Learning-Based and Analytical Model-Driven Tuning Methodologies;2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD);2023-10-17
4. Optimization Techniques for GPU Programming;ACM Computing Surveys;2023-03-16
5. Prefix sum (scan);Programming Massively Parallel Processors;2023