FPGA Implementation of a Special-Purpose VLIW Structure for Double-Precision Elementary Function-Reference-Cited by-同舟云学术

FPGA Implementation of a Special-Purpose VLIW Structure for Double-Precision Elementary Function

Published:2014-06 Issue:2 Volume:7 Page:1-21
ISSN:1936-7406
Container-title:ACM Transactions on Reconfigurable Technology and Systems
language:en
Short-container-title:ACM Trans. Reconfigurable Technol. Syst.

Author:

Lei Yuanwu¹,Guo Lei¹,Dou Yong¹,Ma Sheng¹,Xu Jinbo¹

Affiliation:

1. National University of Defense Technology, China

Abstract

In the current article, the capability and flexibility of field programmable gate-arrays (FPGAs) to implement IEEE-754 double-precision floating-point elementary functions are explored. To perform various elementary functions on the unified hardware efficiently, we propose a special-purpose very long instruction word (VLIW) processor, called DP_VELP. This processor is equipped with multiple basic units, and its performance is improved through an explicitly parallel technique. Pipelined evaluation of polynomial approximation with Estrin's scheme is proposed, by scheduling basic components in an optimal order to avoid data hazard stalls and achieve minimal latency. The custom VLIW processor can achieve high scalability. Under the control of specific VLIW instructions, the basic units are combined into special-purpose hardware for elementary functions. Common elementary functions are presented as examples to illustrate the design of elementary function in DP_VELP in detail. Minimax approximation scheme is used to reduce degree of polynomial. Compromise between the size of lookup table and the latency is discussed, and the internal precision is carefully planned to guarantee accuracy of the result. Finally, we create a prototype of the DP_VELP unit and an FPGA accelerator based on the DP_VELP unit on a Xilinx XC6VLX760 FPGA chip to implement the SGP4/SDP4 application. Compared with previous researches, the proposed design can achieve low latency with a reasonable amount of resources and evaluate a variety of elementary functions with the unified hardware to satisfy the demands in scientific applications. Experimental results show that the proposed design guarantees more than 99% of correct rounding. Moreover, the SGP4/SDP4 accelerator, which is equipped with 39 DP_VELP units and runs at 200 MHz, outperforms the parallel software approach with hyper-thread technology on an Intel Xeon Quad E5620 CPU at 2.40 GHz by a factor of 7X.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/2617594

Reference50 articles.

1. A vector-like reconfigurable floating-point unit for the logarithm. Int;Alachiotis N.;J. Reconfig. Comput. 1--12.,2011

2. Altera. 2012. Introducing innovations at 28 nm to move beyond Moores Law (2010). www.altera.com/literature/wp/wp-01125-stxv-28nm-innovation.pdf. Altera. 2012. Introducing innovations at 28 nm to move beyond Moores Law (2010). www.altera.com/literature/wp/wp-01125-stxv-28nm-innovation.pdf.

3. Higher Radix and Redundancy Factor for Floating Point SRT Division

4. Multipliers for floating-point double precision and beyond on FPGAs