Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS-Reference-Cited by-同舟云学术

Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS

Published:2023-12-06 Issue:24 Volume:13 Page:13022
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Huang Xuanteng¹^ORCID,Zhang Xianwei¹,Yang Panfei¹,Xiao Nong¹

Affiliation:

1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China

Abstract

GPUs have been broadly used to accelerate big data analytics, scientific computing and machine intelligence. Particularly, matrix multiplication and convolution are two principal operations that use a large proportion of steps in modern data analysis and deep neural networks. These performance-critical operations are often offloaded to the GPU to obtain substantial improvements in end-to-end latency. In addition, multifarious workload characteristics and complicated processing phases in big data demand a customizable yet performant operator library. To this end, GPU vendors, including NVIDIA and AMD, have proposed template and composable GPU operator libraries to conduct specific computations on certain types of low-precision data elements. We formalize a set of benchmarks via CUTLASS, NVIDIA’s templated library that provides high-performance and hierarchically designed kernels. The benchmarking results show that, with the necessary fine tuning, hardware-level ASICs like tensor cores could dramatically boost performance in specific operations like GEMM offloading to modern GPUs.

Funder

National Natural Science Foundation of China

Major Program of Guangdong Basic and Applied Research

Funding by Science and Technology Projects in Guangzhou

Open Project of China Electronic Product Reliability and Environmental Testing Research Institute

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/24/13022/pdf

Reference31 articles.

1. Zhao, G., Sun, N., Shen, S., Wu, X., and Wang, L. (2022). GPU-Accelerated Target Strength Prediction Based on Multiresolution Shooting and Bouncing Ray Method. Appl. Sci., 12.

2. Liu, D., Li, B., and Liu, G. (2021). Calculation of Surface Offset Gathers Based on Reverse Time Migration and Its Parallel Computation with Multi-GPUs. Appl. Sci., 11.

3. Golosio, B., Villamar, J., Tiddia, G., Pastorelli, E., Stapmanns, J., Fanti, V., Paolucci, P.S., Morrison, A., and Senk, J. (2023). Runtime Construction of Large-Scale Spiking Neuronal Network Models on GPU Devices. Appl. Sci., 13.

4. Kim, S., Cho, J., and Park, D. (2017). Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications. Appl. Sci., 7.

5. Nguyen, D.V., and Choi, J. (2020). Toward Scalable Video Analytics Using Compressed-Domain Features at the Edge. Appl. Sci., 10.