Automatic generation of ARM NEON micro-kernels for matrix multiplication-Reference-Cited by-同舟云学术

Automatic generation of ARM NEON micro-kernels for matrix multiplication

Published:2024-03-12 Issue:10 Volume:80 Page:13873-13899
ISSN:0920-8542
Container-title:The Journal of Supercomputing
language:en
Short-container-title:J Supercomput

Author:

Alaejos Guillermo,Martínez Héctor,Castelló Adrián,Dolz Manuel F.,Igual Francisco D.,Alonso-Jordá Pedro,Quintana-Ortí Enrique S.

Abstract

AbstractGeneral matrix multiplication (gemm) is a fundamental kernel in scientific computing and current frameworks for deep learning. Modern realisations of gemm are mostly written in C, on top of a small, highly tuned micro-kernel that is usually encoded in assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert. In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data type, and easily generate micro-kernels of any requested dimension. The performance of this solution is tested on three ARM-based cores and compared with state-of-the-art libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results show that the auto-generation approach is highly competitive, mainly due to the possibility of adapting the micro-kernel to the problem dimensions.

Funder

European Commission

European Union

Junta de Andalucía

Agencia Estatal de Investigación

Generalitat Valenciana

Universitat Politècnica de València

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11227-024-05955-8.pdf

Reference20 articles.

1. Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17

2. Kågström B, Ling P, van Loan C (1998) GEMM-based level 3 BLAS: High-performance model implementations and performance evaluation benchmark. ACM Trans Math Softw 24(3):268–302

3. Goto K, van de Geijn R (2008) High-performance implementation of the level-3 BLAS. ACM Trans Math Soft 35(1):1–14

4. Sze V, Chen Y-H, Yang T-J, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329

5. Ben-Nun T, Hoefler T (2019) Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput Surv 52(4):65:1-65:43

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Parallel GEMM-based convolutions for deep learning on multicore ARM and RISC-V architectures;Journal of Systems Architecture;2024-08