Affiliation:
1. University of California, Davis, Davis, CA, USA
2. NVIDIA, Santa Clara, CA, USA
Abstract
We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference31 articles.
1. General-purpose computation using graphics hardware. http://www.gpgpu.org/. General-purpose computation using graphics hardware. http://www.gpgpu.org/.
2. NVIDIA CUDA compute unified device architecture programming guide 2009. Version 2.0. NVIDIA CUDA compute unified device architecture programming guide 2009. Version 2.0.
3. Cyclic reduction on distributed shared memory machines
4. On Direct Methods for Solving Poisson’s Equations
Cited by
89 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献