Affiliation:
1. Huawei Technology Co. Ltd. and Shenzhen Institute of Advanced Technology, Chinese Academy of Science, ShenZhen, China
2. China Electronic Standardization Institute, Beijing, China
3. Shenzhen Institute of Advanced Technology, Chinese Academy of Science, ShenZhen, China
Abstract
Graphics processing units (GPUs)
1
have enjoyed increasing popularity in recent years, which benefits from, for example, general-purpose GPU (GPGPU) for parallel programs and new computing paradigms, such as the Internet of Things (IoT). GPUs hold great potential in providing effective solutions for big data analytics while the demands for processing large quantities of data in real time are also increasing. However, the pervasive presence of GPUs on mobile devices presents great challenges for GPGPU, mainly because GPGPU integrates a large amount of processor arrays and concurrent executing threads (up to hundreds of thousands). In particular, the root causes of performance loss in a GPGPU program can not be revealed in detail by current approaches.
In this article, we propose MiC (Multi-level Characterization), a framework that comprehensively characterizes GPGPU kernels at the instruction, Basic Block (BBL), and thread levels. Specifically, we devise Instruction Vectors (IV) and Basic Blocks Vectors (BBV), a Thread Similarity Matrix (TSM), and a Divergence Flow Statistics Graph (DFSG) to profile information in each level. We use MiC to provide insights into GPGPU kernels through the characterizations of 34 kernels from popular GPGPU benchmark suites such as Compute Unified Device Architecture (CUDA) Software Development Kit (SDK), Rodinia, and Parboil. In comparison with Central Processing Unit (CPU) workloads, we conclude the key findings as follows: (1) There are comparable Instruction-Level Parallelism (ILP); (2) The BBL count is significantly smaller than CPU workloads—only 22.8 on average; (3) The dynamic instruction count per thread varies from dozens to tens of thousands and it is extremely small compared to CPU benchmarks; (4) The Pareto principle (also called 90/10 rule) does not apply to GPGPU kernels while it pervasively exists in CPU programs; (5) The loop patterns are dramatically different from those in CPU workloads; (6) The branch ratio is lower than that of CPU programs but higher than pure GPU workloads. In addition, we have also shown how TSM and DFSG are used to characterize the branch divergence in a visual way, to enable the analysis of thread behavior in GPGPU programs. In addition, we show an optimization case for a GPGPU kernel from the bottleneck identified through its characterization result, which improves 16.8% performance.
Funder
National Natural Science Foundation of China
Postdoctoral Science foundation of China
National Key Rearch and development Program of China
NSF of Guangdong province
Publisher
Association for Computing Machinery (ACM)
Subject
Electrical and Electronic Engineering,Hardware and Architecture,Software
Reference47 articles.
1. AMD. 2009. AMD Brook Plus. Retrieved from https://sourceforge.net/projects/brookplus/. AMD. 2009. AMD Brook Plus. Retrieved from https://sourceforge.net/projects/brookplus/.
2. AMD. 2014. General-purpose Graphics Processing Units Deliver New Capabilities to the Embedded Market.Retreived from http://www.amd.com/Documents/GPGPU-Embedded.pdf. AMD. 2014. General-purpose Graphics Processing Units Deliver New Capabilities to the Embedded Market.Retreived from http://www.amd.com/Documents/GPGPU-Embedded.pdf.
3. Rodinia: A benchmark suite for heterogeneous computing
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming;2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS);2022-07-29