On the RTL Implementation of FINN Matrix Vector Unit-Reference-Cited by-同舟云学术

On the RTL Implementation of FINN Matrix Vector Unit

Published:2022-07-14 Issue: Volume: Page:
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Alam Syed Asad¹^ORCID,Gregg David²^ORCID,Gambardella Giulio³^ORCID,Preusser Thomas⁴^ORCID,Blott Michaela⁵^ORCID

Affiliation:

1. Research Fellow Software Systems Group, Computer Science School of Computer Science and Statistics, Trinity College Dublin, Ireland

2. Professor Software Systems Group, Computer Science School of Computer Science and Statistics, Trinity College Dublin, Ireland

3. Research and Development Engineer Synopsys Inc, Ireland

4. Heterogeneous Digital Systems Research Engineer AMD, Germany

5. Fellow AMD, Ireland

Abstract

FPGA-based accelerators are becoming increasingly popular for deep neural network inference due to their ability to scale performance with increasing degree of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating several options for multi-dimension tensors, convolutional layers or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend library for FINN, leveraging RTL. We investigate and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits as compared to HLS. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around

\(15\% \)

. On the other hand, HLS consistently requires more flip-flops (FFs) (with an orders-of-magnitude difference for smaller designs) and block RAMs (BRAMs) (2 × more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to around

\(80\% \)

. Furthermore, RTL also benefits from at-least a 10 × reduction in synthesis time. Finally, the results were validated in practice using two real-world use cases, one of a multi-layer perceptron (MLP) used in network intrusion detection and the other a convolution network called ResNet used in image recognition. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important. As such, the gained benefits in synthesis time together with some design-dependent resource benefits, make the RTL abstraction an attractive alternative.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3547141

Reference44 articles.

1. 2010. AMBA 4 AXI4-Stream Protocol Specification. 2010. AMBA 4 AXI4-Stream Protocol Specification.

2. On the Implementation of Time-Multiplexed Frequency-Response Masking Filters

3. Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning;Alonso Tobias;ACM Trans. Reconfigurable Technol. Syst.,2021

4. FINN- R

5. N. Bruschi , A. Garofalo , F. Conti , G. Tagliavini , and D. Rossi . 2020. Enabling mixed-precision quantized neural networks in extreme-edge devices . In Proc. ACM Int. Conf. on Computing Frontiers (Sicily , Catania, Italy , May 2020 ). 217–220. N. Bruschi, A. Garofalo, F. Conti, G. Tagliavini, and D. Rossi. 2020. Enabling mixed-precision quantized neural networks in extreme-edge devices. In Proc. ACM Int. Conf. on Computing Frontiers (Sicily, Catania, Italy, May 2020). 217–220.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A critical review on the state-of-the-art and future prospects of machine learning for Earth observation operations;Advances in Space Research;2023-06

2. Development an efficient AXI-interconnect unit between set of customized peripheral devices and an implemented dual-core RISC-V processor;The Journal of Supercomputing;2023-05-05

3. A Configurable Mixed-Precision Convolution Processing Unit Generator in Chisel;2023 26th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS);2023-05-03