LUT‐DSP usage trade‐off for re‐configurable convolution acceleration core based on small logarithmic floating point representation-Reference-Cited by-同舟云学术

LUT‐DSP usage trade‐off for re‐configurable convolution acceleration core based on small logarithmic floating point representation

Published:2023-10-24 Issue: Volume: Page:
ISSN:0098-9886
Container-title:International Journal of Circuit Theory and Applications
language:en
Short-container-title:Circuit Theory & Apps

Author:

Xiong Botao¹^ORCID,Fan Sheng¹,He Xintong¹,Zhou Zezhao¹,Yang Runhua¹,Li Sicun¹^ORCID,Shen Rensheng¹,Chang Yuchun¹

Affiliation:

1. School of Microelectronics Dalian University of Technology Dalian China

Abstract

AbstractThe challenge in designing the high‐performance field‐programmable gate array (FPGA)‐based convolution accelerator is to take full advantage of the on‐chip computing resources. The reported CNN accelerators always exhaust the on‐chip DSPs and leave other computing resources under‐utilized. Hence, this brief presents a novel convolution acceleration core based on the small logarithmic floating‐point (SLFP) format, which results in three contributions. (1) The SLFP<3,5> multiplier is only implemented with LUT6s and operates at 650 MHz with the aid of the carry chain, which provides sufficient accuracy for most CNNs. In addition, a similar structure can be used to implement a SLFP<3,5> divider. (2) The DSPs in the TWO24 SIMD mode are cascaded to implement a 9‐input adder tree. The sum of the multiples of elements (e.g., , ) is easily obtained by configuring the last DSP in the 9‐input adder tree in the accumulation mode, which can support more kernels (e.g., , ) with a high utilization rate (). (3) The convolution core based on the SLFP format only uses LUT6s and DSPs to achieve 1300 MOPS, 433 MOPS, and 81 MOPS for , , and kernel, respectively. In summary, the proposed convolution accelerator not only balances the resource usage of LUT6s and DSPs but also quantizes most CNN models using several simple scaling operations instead of a computing‐intensive retraining algorithm because the distribution of SLFP numbers is very similar to FP32 numbers.

Funder

Fundamental Research Funds for the Central Universities

National Natural Science Foundation of China

Publisher

Wiley

Subject

Applied Mathematics,Electrical and Electronic Engineering,Computer Science Applications,Electronic, Optical and Magnetic Materials

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/cta.3834

Reference16 articles.

1. [DL] A Survey of FPGA-based Neural Network Inference Accelerators

2. DSP-Efficient Hardware Acceleration of Convolutional Neural Network Inference on FPGAs

3. GholamiA KimS DongZ YaoZ MahoneyMW KeutzerK.A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:210313630;2021.

4. Xilinx.Deep learning with INT8 optimization on Xilinx devices (WP486);2016.

5. Xilinx.Convolutional neural network with INT4 optimization on Xilinx devices (WP521).2020.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Low noise, temperature‐compensated, electrochemical cell sigma–delta current measurement readout circuit;International Journal of Circuit Theory and Applications;2024-08