Affiliation:
1. National University of Defense Technology, Changsha, Hunan, China
Abstract
Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.
Funder
National Science Foundation of China
Publisher
Association for Computing Machinery (ACM)
Cited by
75 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献