MiC

Author:

Liu Qixiao1,Chen Zhifeng2,Yu Zhibin3

Affiliation:

1. Huawei Technology Co. Ltd. and Shenzhen Institute of Advanced Technology, Chinese Academy of Science, ShenZhen, China

2. China Electronic Standardization Institute, Beijing, China

3. Shenzhen Institute of Advanced Technology, Chinese Academy of Science, ShenZhen, China

Abstract

Graphics processing units (GPUs) 1 have enjoyed increasing popularity in recent years, which benefits from, for example, general-purpose GPU (GPGPU) for parallel programs and new computing paradigms, such as the Internet of Things (IoT). GPUs hold great potential in providing effective solutions for big data analytics while the demands for processing large quantities of data in real time are also increasing. However, the pervasive presence of GPUs on mobile devices presents great challenges for GPGPU, mainly because GPGPU integrates a large amount of processor arrays and concurrent executing threads (up to hundreds of thousands). In particular, the root causes of performance loss in a GPGPU program can not be revealed in detail by current approaches. In this article, we propose MiC (Multi-level Characterization), a framework that comprehensively characterizes GPGPU kernels at the instruction, Basic Block (BBL), and thread levels. Specifically, we devise Instruction Vectors (IV) and Basic Blocks Vectors (BBV), a Thread Similarity Matrix (TSM), and a Divergence Flow Statistics Graph (DFSG) to profile information in each level. We use MiC to provide insights into GPGPU kernels through the characterizations of 34 kernels from popular GPGPU benchmark suites such as Compute Unified Device Architecture (CUDA) Software Development Kit (SDK), Rodinia, and Parboil. In comparison with Central Processing Unit (CPU) workloads, we conclude the key findings as follows: (1) There are comparable Instruction-Level Parallelism (ILP); (2) The BBL count is significantly smaller than CPU workloads—only 22.8 on average; (3) The dynamic instruction count per thread varies from dozens to tens of thousands and it is extremely small compared to CPU benchmarks; (4) The Pareto principle (also called 90/10 rule) does not apply to GPGPU kernels while it pervasively exists in CPU programs; (5) The loop patterns are dramatically different from those in CPU workloads; (6) The branch ratio is lower than that of CPU programs but higher than pure GPU workloads. In addition, we have also shown how TSM and DFSG are used to characterize the branch divergence in a visual way, to enable the analysis of thread behavior in GPGPU programs. In addition, we show an optimization case for a GPGPU kernel from the bottleneck identified through its characterization result, which improves 16.8% performance.

Funder

National Natural Science Foundation of China

Postdoctoral Science foundation of China

National Key Rearch and development Program of China

NSF of Guangdong province

Publisher

Association for Computing Machinery (ACM)

Subject

Electrical and Electronic Engineering,Hardware and Architecture,Software

Reference47 articles.

1. AMD. 2009. AMD Brook Plus. Retrieved from https://sourceforge.net/projects/brookplus/. AMD. 2009. AMD Brook Plus. Retrieved from https://sourceforge.net/projects/brookplus/.

2. AMD. 2014. General-purpose Graphics Processing Units Deliver New Capabilities to the Embedded Market.Retreived from http://www.amd.com/Documents/GPGPU-Embedded.pdf. AMD. 2014. General-purpose Graphics Processing Units Deliver New Capabilities to the Embedded Market.Retreived from http://www.amd.com/Documents/GPGPU-Embedded.pdf.

3. Rodinia: A benchmark suite for heterogeneous computing

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming;2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS);2022-07-29

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3