Affiliation:
1. University of Ferrara, Italy & National Institute for Nuclear Physics, Italy
Abstract
GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, and are quickly becoming very popular processors for HPC applications. Still, writing efficient and scalable programs for GPUs is not an easy task as codes must adapt to increasingly parallel architecture features. In this chapter, the authors describe in full detail design and implementation strategies for lattice Boltzmann (LB) codes able to meet these goals. Most of the discussion uses a state-of-the art thermal lattice Boltzmann method in 2D, but all lessons learned in this particular case can be immediately extended to most LB and other scientific applications. The authors describe the structure of the code, discussing in detail several key design choices that were guided by theoretical models of performance and experimental benchmarks, having in mind both single-GPU codes and massively parallel implementations on commodity clusters of GPUs. The authors then present and analyze performances on several recent GPU architectures, including data on energy optimization.