Area and performance tradeoffs in floating-point divide and square-root implementations-Reference-Cited by-同舟云学术

Area and performance tradeoffs in floating-point divide and square-root implementations

Published:1996-09 Issue:3 Volume:28 Page:518-564
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Soderquist Peter¹,Leeser Miriam²

Affiliation:

1. Cornell Univ, Ithaca, NY

2. Northeastern Univ., Boston, MA

Abstract

Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidt's algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/243439.243481

Reference62 articles.

1. Architecture of the Pentium microprocessor

2. The IBM System/ 360 Model 91: Floating-point execution unit;ANDERSON S. F.;IBM J. Res.,1967

3. Performance features of the PA7100 microprocessor

4. Higher-radix division using estimates of the divisor and partial remainders;ATKINS D.E.;IEEE Trans. Comput.,1968

Cited by 49 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Novel seed generation and quadrature-based square rooting algorithms;Scientific Reports;2022-11-29

2. Optimization of cosmological N-body simulation with FMM-PM on SIMT accelerators;The Journal of Supercomputing;2021-11-05

3. Floating-Point Inverse Square Root Algorithm Based on Taylor-Series Expansion;IEEE Transactions on Circuits and Systems II: Express Briefs;2021-07

4. Ultralow-Latency VLSI Architecture Based on a Linear Approximation Method for Computing Nth Roots of Floating-Point Numbers;IEEE Transactions on Circuits and Systems I: Regular Papers;2021-02

5. A New Multiple-Symbol Differential Detection Strategy for Error-Floor Elimination of IEEE 802.15.4 BPSK Receivers Impaired by Carrier Frequency Offset;Wireless Communications and Mobile Computing;2019-11-26