Accelerated linear algebra compiler for computationally efficient numerical models: Success and potential area of improvement

Author:

He XuzhenORCID

Abstract

The recent dramatic progress in machine learning is partially attributed to the availability of high-performant computers and development tools. The accelerated linear algebra (XLA) compiler is one such tool that automatically optimises array operations (mostly fusion to reduce memory operations) and compiles the optimised operations into high-performant programs specific to target computing platforms. Like machine-learning models, numerical models are often expressed in array operations, and thus their performance can be boosted by XLA. This study is the first of its kind to examine the efficiency of XLA for numerical models, and the efficiency is examined stringently by comparing its performance with that of optimal implementations. Two shared-memory computing platforms are examined–the CPU platform and the GPU platform. To obtain optimal implementations, the computing speed and its optimisation are rigorously studied by considering different workloads and the corresponding computer performance. Two simple equations are found to faithfully modell the computing speed of numerical models with very few easily-measureable parameters. Regarding operation optimisation within XLA, results show that models expressed in low-level operations (e.g., slice, concatenation, and arithmetic operations) are successfully fused while high-level operations (e.g., convolution and roll) are not. Regarding compilation within XLA, results show that for the CPU platform of certain computers and certain simple numerical models on the GPU platform, XLA achieves high efficiency (> 80%) for large problems and acceptable efficiency (10%~80%) for medium-size problems–the gap is from the overhead cost of Python. Unsatisfactory performance is found for the CPU platform of other computers (operations are compiled in a non-optimal way) and for high-dimensional complex models for the GPU platform, where each GPU thread in XLA handles 4 (single precision) or 2 (double precision) output elements–hoping to exploit the high-performant instructions that can read/write 4 or 2 floating-point numbers with one instruction. However, these instructions are rarely used in the generated code for complex models and performance is negatively affected. Therefore, flags should be added to control the compilation for these non-optimal scenarios.

Funder

Australian Research Council

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference31 articles.

1. Work–energy analysis of granular assemblies validates and calibrates a constitutive model.;X. He;Granul Matter.,2020

2. Ready-to-use deep-learning surrogate models for problems with spatially variable inputs and outputs;X. He;Acta Geotechnica,2022

3. A new k-ϵ eddy viscosity model for high reynolds number turbulent flows;T.-H. Shih;Computers & Fluids,1995

4. Fractional diffusion models of option prices in markets with jumps;Á. Cartea;Physica A: Statistical Mechanics and Its Applications,2007

5. Computational Complexity: A Modern Approach;S. Arora,2009

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Learning Constitutive Relations From Soil Moisture Data via Physically Constrained Neural Networks;Water Resources Research;2024-07

2. Pros and Cons of Executable Neural Networks for Deeply Embedded Systems;Proceedings of the 2023 Workshop on Compilers, Deployment, and Tooling for Edge AI;2023-09-21

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3