Fast matrix multiplication via compiler‐only layered data reorganization and intrinsic lowering

Author:

Kuzma Braedy1,Korostelev Ivan1,de Carvalho João P. L.1ORCID,Moreira José E.2,Barton Christopher3,Araujo Guido4,Amaral José Nelson1

Affiliation:

1. Computing Science Department University of Alberta Edmonton Alberta Canada

2. Thomas J. Watson Research Center IBM Corporation New York New York USA

3. IBM Canada Software Laboratory IBM Corporation Markham Ontario Canada

4. Institute of Computing UNICAMP Campinas São Paulo Brazil

Abstract

AbstractThe resurgence of machine learning has increased the demand for high‐performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High‐performance BLAS implementations rely on a layered approach that consists of tiling and packing layers—for data (re)organization—and micro kernels that perform the actual computations. The algorithm for the tiling and packing layers is target independent but is parameterized to the memory hierarchy and register‐file size. The creation of high‐performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by 's (Matrix Multiply Assist—MMA), (Advanced Matrix eXtensions—AMX), and (Matrix Extensions—ME) to deliver high‐performance matrix operations. This article presents a compiler‐only alternative to the use of high‐performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix‐multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The parameterization of the tiling and packing layers is demonstrated in the generation of code for the MMA unit on IBM's POWER10. This article also describes an algorithm that lowers the matrix‐multiply intrinsic to the MMA unit. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to (Intel)—for small matrices—and more than (POWER9)—for large matrices—faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high‐performance libraries and is only slower than OpenBLAS and on‐par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over faster the vector‐extension solution, matches Eigen performance, and achieves up to of BLAS peak performance.

Publisher

Wiley

Subject

Software

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3