Empowering edge devices: FPGA‐based 16‐bit fixed‐point accelerator with SVD for CNN on 32‐bit memory‐limited systems-Reference-Cited by-同舟云学术

Empowering edge devices: FPGA‐based 16‐bit fixed‐point accelerator with SVD for CNN on 32‐bit memory‐limited systems

Published:2024-02-13 Issue:9 Volume:52 Page:4755-4782
ISSN:0098-9886
Container-title:International Journal of Circuit Theory and Applications
language:en
Short-container-title:Circuit Theory & Apps

Author:

Yanamala Rama Muni Reddy¹^ORCID,Pullakandam Muralidhar¹

Affiliation:

1. Department of ECE National Institute of Technology Warangal Telangana India

Abstract

AbstractConvolutional neural networks (CNNs) are now often used in deep learning and computer vision applications. Its convolutional layer accounts for most calculations and should be computed fast in a local edge device. Field‐programmable gate arrays (FPGAs) have been adequately explored as promising hardware accelerators for CNNs due to their high performance, energy efficiency, and reconfigurability. This paper developed an efficient FPGA‐based 16‐bit fixed‐point hardware accelerator unit for deep learning applications on the 32‐bit low‐memory edge device (PYNQ‐Z2 board). Additionally, singular value decomposition is applied to the fully connected layer for dimensionality reduction of weight parameters. The accelerator unit was designed for all five layers and employed eight processing elements in convolution layers 1 and 2 for parallel computations. In addition, array partitioning, loop unrolling, and pipelining are the techniques used to increase the speed of calculations. The AXI‐Lite interface was also used to communicate between IP and other blocks. Moreover, the design is tested with grayscale image classification on MNIST handwritten digit dataset and color image classification on the Tumor dataset. The experimental results show that the proposed accelerator unit implementation performs faster than the software‐based implementation. Its inference speed is 89.03% more than INTEL 3‐core CPU, 86.12% higher than Haswell 2‐core CPU, and 82.45% more than NVIDIA Tesla K80 GPU. Furthermore, the throughput of the proposed design is 4.33GOP/s, which is better than the conventional CNN accelerator architectures.

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/cta.3957

Reference45 articles.

1. HCP: A Flexible CNN Framework for Multi-Label Image Classification

2. Imagenet classification with deep convolutional neural networks;Krizhevsky A;Adv Neural Inf Process Systs,2012

3. Speech Recognition Using Deep Neural Networks: A Systematic Review

4. BojarskiM Del TestaD DworakowskiD et al.End to end learning for self‐driving cars. arXiv preprint arXiv:1604.07316;2016.

5. StriglD KoflerK PodlipnigS.Performance and scalability of GPU‐based convolutional neural networks. In: 2010 18th Euromicro Conference on Parallel Distributed and Network‐Based Processing IEEE;2010:317‐324.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A high‐throughput flexible lossless compression and decompression architecture for color images;International Journal of Circuit Theory and Applications;2024-09-04