Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs

Author:

Macintosh Hamish J.12ORCID,Banks Jasmine E.1ORCID,Kelson Neil A.2

Affiliation:

1. School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Queensland 4001, Australia

2. eResearch Office, Division of Research and Innovation, Queensland University of Technology, Brisbane, Queensland 4001, Australia

Abstract

Solving diagonally dominant tridiagonal linear systems is a common problem in scientific high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator’s compute performance. The truncated SPIKE FPGA solver is developed first for optimising OpenCL device kernel performance, global memory bandwidth, and interleaved host to device memory transactions. The FPGA OpenCL kernel code is then refactored and optimised to best exploit the underlying architecture of the CPU and GPU. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150%, and 170% faster, respectively, than comparable device-optimised third-party solvers and applicable baselines. Assessing heterogeneous combinations of compute devices, the GPU + FPGA combination is found to have the best compute performance and the FPGA-only configuration is found to have the best overall estimated energy efficiency.

Publisher

Hindawi Limited

Subject

Hardware and Architecture

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. High throughput multidimensional tridiagonal system solvers on FPGAs;Proceedings of the 36th ACM International Conference on Supercomputing;2022-06-28

2. FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL;International Workshop on OpenCL;2022-05-10

3. Efficient Hardware Implementation of Error Correcting Codes Classification Algorithm;2021 2nd International Conference on Electronics, Communications and Information Technology (CECIT);2021-12

4. An Overview of Cyber-Physical Systems’ Hardware Architecture Concerning Machine Learning;2021 IEEE/AIAA 40th Digital Avionics Systems Conference (DASC);2021-10-03

5. Vector Operations for Accelerating Expensive Bayesian Computations – A Tutorial Guide;Bayesian Analysis;2021-01-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3