An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism-Reference-Cited by-同舟云学术

An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism

Published:2022-05-04 Issue:3 Volume:19 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Choudhury Ziaul¹^ORCID,Shrivastava Shashwat¹,Ramapantulu Lavanya¹,Purini Suresh¹

Affiliation:

1. International Institute of Information Technology, Hyderabad, Telangana, India

Abstract

Increasingly, pre-trained convolutional neural networks (CNNs) are being deployed for inference in various computer vision applications, both on the server-side in the data centers and at the edge. CNN inference is a very compute-intensive task. It is a challenge to meet performance metrics such as latency and throughput while optimizing power. Special-purpose ASICs and FPGAs are suitable candidates to meet these power and performance budgets simultaneously. Rapidly evolving CNN architectures involve novel convolution operations such as point convolutions, depth separable convolutions, and so on. This leads to substantial variation in the computational structure across CNNs and layers within a CNN. Because of this, FPGA reconfigurability provides an attractive tradeoff compared to ASICs. FPGA-based hardware designers address the structural variability issue by generating a network-specific accelerator for a single network or a class of networks. However, homogeneous accelerators are network agnostic and often sacrifice throughput and FPGA LUTs for flexibility. In this article, we propose an FPGA overlay for efficient processing of CNNs that can be scaled based on the available compute and memory resources of the FPGA. The overlay is configured on the fly through control words sent by the host on a per-layer basis. Unlike current overlays, our architecture exploits all forms of parallelism inside a convolution operation. A constraint system is employed at the host end to find out the per-layer configuration of the overlay that uses all forms of parallelism in the processing of the layer, resulting in the highest throughput for that layer. We studied the effectiveness of our overlay by using it to process AlexNet, VGG16, YOLO, MobileNet, and ResNet-50 CNNs targeting a Virtex7 and a bigger Ultrascale+VU9P FPGAs. The chosen CNNs have a mix of different types of convolution layers and filter sizes, presenting a good variation in model size and structure. Our accelerator reported a maximum throughput of 1,200 GOps/second on the Virtex7, an improvement of 1.2

\( \times \)

to 5

\( \times \)

over the recent designs. Also, the reported performance density, measured in giga operations per second per KLUT, is 1.3

\( \times \)

to 4

\( \times \)

improvement over existing works. Similar speed-up and performance density is also observed for the Ultrascale+VU9P FPGA.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3519598

Reference56 articles.

1. Mohamed S. Abdelfattah David Han Andrew Bitar Roberto DiCecco Shane OConnell Nitika Shanker Joseph Chu Ian Prins Joshua Fender Andrew C. Ling and Gordon R. Chiu. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. arxiv:1807.06434. Retrieved from http://arxiv.org/abs/1807.06434.

2. Kamel Abdelouahab Maxime Pelcat Jocelyn Sérot François Berry Cédric Bourrasset and Jean-Charles Quinton. 2017. Hardware automated dataflow deployment of CNNs. arxiv:1705.04543. Retrieved from http://arxiv.org/abs/1705.04543.

3. Utku Aydonat Shane O’Connell Davor Capalija Andrew C. Ling and Gordon R. Chiu. 2017. An OpenCL deep learning accelerator on Arria 10. arxiv:cs.DC/1701.03534. Retrieved from https://arxiv.org/abs/1701.03534.

4. A CNN Accelerator on FPGA Using Depthwise Separable Convolution

5. A dynamically configurable coprocessor for convolutional neural networks

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Architectural Template for FPGA Overlays Targeting Data Flow Applications;2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2024-05-27

2. Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs;ACM Transactions on Architecture and Code Optimization;2024-03-23

3. An Approach Towards Distributed DNN Training on FPGA Clusters;Lecture Notes in Computer Science;2024

4. FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler;ACM Transactions on Architecture and Code Optimization;2023-12-14

5. Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference;ACM Transactions on Architecture and Code Optimization;2023-10-26