Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS
-
Published:2023-12-06
Issue:24
Volume:13
Page:13022
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Huang Xuanteng1ORCID, Zhang Xianwei1, Yang Panfei1, Xiao Nong1
Affiliation:
1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
Abstract
GPUs have been broadly used to accelerate big data analytics, scientific computing and machine intelligence. Particularly, matrix multiplication and convolution are two principal operations that use a large proportion of steps in modern data analysis and deep neural networks. These performance-critical operations are often offloaded to the GPU to obtain substantial improvements in end-to-end latency. In addition, multifarious workload characteristics and complicated processing phases in big data demand a customizable yet performant operator library. To this end, GPU vendors, including NVIDIA and AMD, have proposed template and composable GPU operator libraries to conduct specific computations on certain types of low-precision data elements. We formalize a set of benchmarks via CUTLASS, NVIDIA’s templated library that provides high-performance and hierarchically designed kernels. The benchmarking results show that, with the necessary fine tuning, hardware-level ASICs like tensor cores could dramatically boost performance in specific operations like GEMM offloading to modern GPUs.
Funder
National Natural Science Foundation of China Major Program of Guangdong Basic and Applied Research Funding by Science and Technology Projects in Guangzhou Open Project of China Electronic Product Reliability and Environmental Testing Research Institute
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference31 articles.
1. Zhao, G., Sun, N., Shen, S., Wu, X., and Wang, L. (2022). GPU-Accelerated Target Strength Prediction Based on Multiresolution Shooting and Bouncing Ray Method. Appl. Sci., 12. 2. Liu, D., Li, B., and Liu, G. (2021). Calculation of Surface Offset Gathers Based on Reverse Time Migration and Its Parallel Computation with Multi-GPUs. Appl. Sci., 11. 3. Golosio, B., Villamar, J., Tiddia, G., Pastorelli, E., Stapmanns, J., Fanti, V., Paolucci, P.S., Morrison, A., and Senk, J. (2023). Runtime Construction of Large-Scale Spiking Neuronal Network Models on GPU Devices. Appl. Sci., 13. 4. Kim, S., Cho, J., and Park, D. (2017). Moving-Target Position Estimation Using GPU-Based Particle Filter for IoT Sensing Applications. Appl. Sci., 7. 5. Nguyen, D.V., and Choi, J. (2020). Toward Scalable Video Analytics Using Compressed-Domain Features at the Edge. Appl. Sci., 10.
|
|