Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design-Reference-Cited by-同舟云学术

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Published:2017-12 Issue:03n04 Volume:27 Page:1750006
ISSN:0129-6264
Container-title:Parallel Processing Letters
language:en
Short-container-title:Parallel Process. Lett.

Author:

Merchant Farhad¹^ORCID,Chattopadhyay Anupam¹,Raha Soumyendu²,Nandy S. K.²,Narayan Ranjani³

Affiliation:

1. School of Computer Science and Engineering, Nanyang Technological University, Singapore

2. Department of Computational and Data Science, Indian Institute of Science, Bangalore, India 560012, India

3. Morphing Machines Pvt. Ltd, India

Abstract

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.

Publisher

World Scientific Pub Co Pte Lt

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0129626417500062

Reference6 articles.

1. LAPACK Users' Guide

2. Deep submicron microprocessor design issues

3. Exploiting fast matrix multiplication within the level 3 BLAS

4. The Movidius Myriad Architecture's Potential for Scientific Computing

5. The specialization trend in computer hardware

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. High-Performance Computing Based Operating Systems, Software Dependencies and IoT Integration;Series in BioEngineering;2024

2. Parallel Optimization of BLAS on a New-Generation Sunway Supercomputer;Journal of Circuits, Systems and Computers;2023-05-23

3. Performance of a computing pipeline with data hazards and different stage time delays;Journal of Physics: Conference Series;2021-10-01

4. Models for Calculating Pipeline Performance with Data Hazards;Current Problems and Ways of Industry Development: Equipment and Technologies;2021

5. Critical pipeline of the acyclic wave processor;Journal of Physics: Conference Series;2020-11-01