Simultaneous branch and warp interweaving for sustained GPU performance-Reference-Cited by-同舟云学术

Simultaneous branch and warp interweaving for sustained GPU performance

Published:2012-09-05 Issue:3 Volume:40 Page:49-60
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Brunie Nicolas¹,Collange Caroline²,Diamos Gregory³

Affiliation:

1. Kalray and ENS de Lyon

2. Universidade Federal de Minas Gerais

3. NVIDIA Research

Abstract

Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/2366231.2337166

Reference30 articles.

1. Transparent control independence (TCI)

2. Rodinia: A benchmark suite for heterogeneous computing

3. Barra: A Parallel Functional Simulator for GPGPU

4. Control Flow Optimization Via Dynamic Reconvergence Prediction

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Scalable dual-instruction multiple-data processing on an efficient systolic-array architecture;2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2024-05-27

2. Optimization Methods for Computing System in Mobile CPS;Proceedings of the 2nd International Conference on Big Data Technologies - ICBDT2019;2019

3. An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor;Journal of Systems Architecture;2016-08

4. A Closer Look at GPGPU;ACM Computing Surveys;2016-05-02

5. Efficient warp execution in presence of divergence with collaborative context collection;Proceedings of the 48th International Symposium on Microarchitecture;2015-12-05