Affiliation:
1. State Key Laboratory on Integrated Optoelectronics, College of Electronic Science & Engineering, Jilin University, Changchun 130012, China
2. The Department of Quantum & Computer Engineering, Delft University of Technology, 2628 CD Delft, The Netherlands
3. School of Microelectronics, Dalian University of Technology, Dalian 116620, China
Abstract
Convolutional neural networks (CNNs) are to be effective in many application domains, especially in the computer vision area. In order to achieve lower latency CNN processing, and reduce power consumption, developers are experimenting with using FPGAs to accelerate CNN processing in several applications. Current FPGA CNN accelerators usually use the same acceleration approaches as GPUs, where operations from different network layers are mapped to the same hardware units working in a multiplexed manner. This will result in high flexibility in implementing different types of CNNs; however, this will degrade the latency that accelerators can achieve. Alternatively, we can reduce the latency of the accelerator by pipelining the processing of consecutive layers, at the expense of more FPGA resources. The continued increase in hardware resources available in FPGAs makes such implementations feasible for latency-critical application domains. In this paper, we present FPQNet, a fully pipelined and quantized CNN FPGA implementation that is channel-parallel, layer-pipelined, and network-parallel, to decrease latency and increase throughput, combined with quantization methods to optimize hardware utilization. In addition, we optimize this hardware architecture for the HDMI timing standard to avoid extra hardware utilization. This makes it possible for the accelerator to handle video datasets. We present prototypes of the FPQNet CNN network implementations on an Alpha Data 9H7 FPGA, connected with an OpenCAPI interface, to demonstrate architecture capabilities. Results show that with a 250 MHz clock frequency, an optimized LeNet-5 design is able to achieve latencies as low as 9.32 µs with an accuracy of 98.8% on the MNIST dataset, making it feasible for utilization in high frame rate video processing applications. With 10 hardware kernels working concurrently, the throughput is as high as 1108 GOPs. The methods in this paper are suitable for many other CNNs. Our analysis shows that the latency of AlexNet, ZFNet, OverFeat-Fast, and OverFeat-Accurate can be as low as 69.27, 66.95, 182.98, and 132.6 µs, using the architecture introduced in this paper, respectively.
Funder
Innovation Team Support Plan of Dalian
National Natural Science Foundation of China
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献