Improving performance of SYCL applications on CPU architectures using LLVM‐directed compilation flow

Author:

Ghiglio Pietro1ORCID,Dolinsky Uwe1,Goli Mehdi1,Narasimhan Kumudha1ORCID

Affiliation:

1. Codeplay Software Ltd. Edinburgh UK

Abstract

SummaryThe wide adoption of SYCL as an open‐standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data‐centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run‐time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU‐directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach—the CPU‐directed compilation flow, with an OpenCL backend for existing SYCL‐based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N‐body simulation benchmarks and SYCL‐BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL; 2017.), on CPUs from different vendors and architectures. We report a performance improvement of up to on BabelStream benchmarks, up to on Matmul, up to on the N‐body simulation benchmark and up to 16% on SYCL‐BLAS.

Funder

Innovate UK

Publisher

Wiley

Subject

Computational Theory and Mathematics,Computer Networks and Communications,Computer Science Applications,Theoretical Computer Science,Software

Reference41 articles.

1. Intel® threading building blocks;Pheatt C;J Comput Sci Coll,2008

2. NVIDIA CUDA programming model. Accessed December 10 2021.http://www.nvidia.com/CUDA

3. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

4. SYCL specification: C++ single‐source heterogeneous programming.https://www.khronos.org/sycl/

5. Performance portability through machine learning guided kernel selection in SYCL libraries

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3