High-Performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System-Reference-Cited by-同舟云学术

High-Performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Published:2022-05-02 Issue: Volume: Page:
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Hu Xianghong¹^ORCID,Huang Hongmin¹^ORCID,Li Xueming¹^ORCID,Zheng Xin¹^ORCID,Ren Qinyuan²¹^ORCID,He Jingyu³¹^ORCID,Xiong Xiaoming¹¹^ORCID

Affiliation:

1. Guangdong University of Technology, China

2. Zhejiang University, China

3. Hong Kong University of Science and Technology, Hong Kong

Abstract

Deep convolutional neural networks (DNNs) have been widely used in many applications, particularly in machine vision. It is challenging to accelerate DNNs on embedded systems because real-world machine vision applications should reserve a lot of external memory bandwidth for other tasks, such as video capture and display while leaving little bandwidth for accelerating DNNs. In order to solve this issue, in this study, we propose a high-throughput accelerator, called reconfigurable tiny neural-network accelerator (ReTiNNA) for the bandwidth-limited system, and present a real-time object detection system for the high-resolution video image. We first present a dedicated computation engine that takes different data mapping methods for various filter types to improve data reuse and reduce hardware resources. We then propose an adaptive layer-wise tiling strategy that tiles the feature maps into strips to reduce the control complexity of data transmission dramatically and to improve the efficiency of data transmission. Finally, a design space exploration (DSE) approach is presented to explore design space more accurately in the case of insufficient bandwidth to improve the performance of the low-bandwidth accelerator. With a low bandwidth of 2.23 GB/s and a low hardware consumption of 90.261K LUTs and 448 DSPs, ReTiNNA can still achieve a high performance of 155.86 GOPS on VGG16 and 68.20 GOPS on ResNet50, which is better than other state-of-the-art designs implemented on FPGA devices. Furthermore, the real-time object detection system can achieve a high object detection speed of 19 fps for high-resolution video.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3530818

Reference39 articles.

1. DenseLightNet: A Light-Weight Vehicle Detection Network for Autonomous Driving

2. FLASH: Fast Neural Architecture Search with Hardware Optimization;Li G.;ACM Transactions on Embedded Computing Systems,2021

3. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

4. Neural-Network-Based Low-Speed-Damping Controller for Stepper Motor With an FPGA

5. Neural-Dynamics-Driven Complete Area Coverage Navigation Through Cooperation of Multiple Mobile Robots

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems;ACM Transactions on Embedded Computing Systems;2024-04-25

2. A Tiny Accelerator for Mixed-Bit Sparse CNN Based on Efficient Fetch Method of SIMO SPad;IEEE Transactions on Circuits and Systems II: Express Briefs;2023-08

3. Brain-inspired methods for achieving robust computation in heterogeneous mixed-signal neuromorphic processing systems;Neuromorphic Computing and Engineering;2023-07-25

4. Brain-inspired methods for achieving robust computation in heterogeneous mixed-signal neuromorphic processing systems;2022-10-27