Affiliation:
1. School of Software, Shandong University, Jinan, Shandong, P. R. China
2. School of Software, East China Normal University, Shanghai, P. R. China
Abstract
The new-generation Sunway supercomputer has ultra-high computing capacity. But due to the unique heterogeneous architecture of the supercomputer, the open-source versions of basic linear algebra subprograms (BLAS) are insufficient for performance or compatibility. In addition, due to the update of the architecture, BLAS based on the previous Sunway could not fully exploit the performance of the successor. To address the challenges, we propose an optimized BLAS on the new-generation Sunway supercomputer in this paper. Specially, for achieving efficient computation, a parallel optimization method based on the new-generation Sunway for the Level-1 BLAS computing between vectors and the Level-2 BLAS computing between vectors and matrices is first proposed. Then, an adaptive scheduling algorithm for various data sizes is proposed, which is used to balance the tasks of core groups. Finally, to achieve highly efficient general matrix multiplication (GEMM) kernels, a parallel optimization method based on the new-generation Sunway for the Level-3 BLAS computing between matrices is proposed, which includes source-level optimization as well as assembly-level optimization. Experimental results show that the memory bandwidth utilization of the optimized Level-1/2 BLAS exceeds 95%, and the computational efficiency of the optimized GEMM kernel exceeds 94%.
Funder
Key Technologies Research and Development Program
Publisher
World Scientific Pub Co Pte Ltd
Subject
Electrical and Electronic Engineering,Hardware and Architecture,Electrical and Electronic Engineering,Hardware and Architecture