VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing-Reference-Cited by-同舟云学术

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Published:2017-04-24 Issue:09 Volume:26 Page:1750129
ISSN:0218-1266
Container-title:Journal of Circuits, Systems and Computers
language:en
Short-container-title:J CIRCUIT SYST COMP

Author:

Najoui Mohamed¹,Bahtat Mounir¹,Hatim Anas²,Belkouch Said¹,Chabini Noureddine³

Affiliation:

1. LGECOS Lab, ENSA-Marrakech, University of Cadi Ayyad, Marrakech, Morocco

2. ENSA-Agadir, Ibn Zohr University, Agadir, Morocco

3. Department of Electrical and Computer Engineering, Royal Military College of Canada, Kingston, ON, Canada K7K 7B4, Canada

Abstract

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12[Formula: see text] and 6[Formula: see text] speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

Publisher

World Scientific Pub Co Pte Lt

Subject

Electrical and Electronic Engineering,Hardware and Architecture,Electrical and Electronic Engineering,Hardware and Architecture

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218126617501298

Reference13 articles.

1. Trace Scheduling: A Technique for Global Microcode Compaction

2. Parallel tiled QR factorization for multicore architectures

3. Through-Wall Image Enhancement Using Fuzzy and QR Decomposition

4. High-Resolution Radar via Compressed Sensing

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ultra-fast and efficient implementation schemes of complex matrix multiplication algorithm for VLIW architectures;Computers and Electrical Engineering;2022-09

2. An efficient and scalable parallel mapping of pulse-Doppler radar signal processing chain on a multi-core DSP;Microprocessors and Microsystems;2021-09

3. Novel Implementation Approach with Enhanced Memory Access Performance of MGS Algorithm for VLIW Architecture;Journal of Circuits, Systems and Computers;2020-02-19