On the RTL Implementation of FINN Matrix Vector Unit

Author:

Alam Syed Asad1ORCID,Gregg David2ORCID,Gambardella Giulio3ORCID,Preusser Thomas4ORCID,Blott Michaela5ORCID

Affiliation:

1. Research Fellow Software Systems Group, Computer Science School of Computer Science and Statistics, Trinity College Dublin, Ireland

2. Professor Software Systems Group, Computer Science School of Computer Science and Statistics, Trinity College Dublin, Ireland

3. Research and Development Engineer Synopsys Inc, Ireland

4. Heterogeneous Digital Systems Research Engineer AMD, Germany

5. Fellow AMD, Ireland

Abstract

FPGA-based accelerators are becoming increasingly popular for deep neural network inference due to their ability to scale performance with increasing degree of specialization with dataflow architectures or custom data type precision. In order to reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating several options for multi-dimension tensors, convolutional layers or different degrees of parallelism. For this reason, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend library for FINN, leveraging RTL. We investigate and evaluate, across a spectrum of design dimensions, the pros and cons of an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits as compared to HLS. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around \(15\% \) . On the other hand, HLS consistently requires more flip-flops (FFs) (with an orders-of-magnitude difference for smaller designs) and block RAMs (BRAMs) (2 × more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to around \(80\% \) . Furthermore, RTL also benefits from at-least a 10 × reduction in synthesis time. Finally, the results were validated in practice using two real-world use cases, one of a multi-layer perceptron (MLP) used in network intrusion detection and the other a convolution network called ResNet used in image recognition. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important. As such, the gained benefits in synthesis time together with some design-dependent resource benefits, make the RTL abstraction an attractive alternative.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Reference44 articles.

1. 2010. AMBA 4 AXI4-Stream Protocol Specification. 2010. AMBA 4 AXI4-Stream Protocol Specification.

2. On the Implementation of Time-Multiplexed Frequency-Response Masking Filters

3. Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning;Alonso Tobias;ACM Trans. Reconfigurable Technol. Syst.,2021

4. FINN- R

5. N. Bruschi , A. Garofalo , F. Conti , G. Tagliavini , and D. Rossi . 2020. Enabling mixed-precision quantized neural networks in extreme-edge devices . In Proc. ACM Int. Conf. on Computing Frontiers (Sicily , Catania, Italy , May 2020 ). 217–220. N. Bruschi, A. Garofalo, F. Conti, G. Tagliavini, and D. Rossi. 2020. Enabling mixed-precision quantized neural networks in extreme-edge devices. In Proc. ACM Int. Conf. on Computing Frontiers (Sicily, Catania, Italy, May 2020). 217–220.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Towards Deploying Highly Quantized Neural Networks on FPGA Using Chisel;2023 26th Euromicro Conference on Digital System Design (DSD);2023-09-06

2. A critical review on the state-of-the-art and future prospects of machine learning for Earth observation operations;Advances in Space Research;2023-06

3. Development an efficient AXI-interconnect unit between set of customized peripheral devices and an implemented dual-core RISC-V processor;The Journal of Supercomputing;2023-05-05

4. A Configurable Mixed-Precision Convolution Processing Unit Generator in Chisel;2023 26th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS);2023-05-03

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3