High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design-Reference-Cited by-同舟云学术

High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design

Published:2024-01-15 Issue:1 Volume:17 Page:1-20
ISSN:1936-7406
Container-title:ACM Transactions on Reconfigurable Technology and Systems
language:en
Short-container-title:ACM Trans. Reconfigurable Technol. Syst.

Author:

Anupreetham Anupreetham¹^ORCID,Ibrahim Mohamed²^ORCID,Hall Mathew³^ORCID,Boutros Andrew⁴^ORCID,Kuzhively Ajay¹^ORCID,Mohanty Abinash¹^ORCID,Nurvitadhi Eriko⁵^ORCID,Betz Vaughn⁴^ORCID,Cao Yu¹^ORCID,Seo Jae-Sun¹^ORCID

Affiliation:

1. Arizona State University, USA

2. University of Toronto, Intel Corporation, Canada

3. University of Toronto, Canada

4. University of Toronto, Vector Institute for AI, Canada

5. Intel Corporation, USA

Abstract

Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs) for feature extraction, coupled with single-shot multi-box detectors (SSDs) that generate bounding boxes around the identified objects along with their classification confidence scores. Subsequently, a non-maximum suppression (NMS) module removes any redundant detection boxes from the final output. Typical NMS algorithms must wait for all box predictions to be generated by the SSD-based feature extractor before processing them. This sequential dependency between box predictions and NMS results in a significant latency overhead and degrades the overall system throughput, even if a high-performance CNN accelerator is used for the SSD feature extraction component. In this paper, we present a novel pipelined NMS algorithm that eliminates this sequential dependency and associated NMS latency overhead. We then use our novel NMS algorithm to implement an end-to-end fully pipelined FPGA system for low-latency SSD-MobileNet-V1 object detection. Our system, implemented on an Intel Stratix 10 FPGA, runs at 400 MHz and achieves a throughput of 2,167 frames per second with an end-to-end batch-1 latency of 2.13 ms. Our system achieves 5.3× higher throughput and 5× lower latency compared to the best prior FPGA-based solution with comparable accuracy.

Funder

NSF

Intel ISRA program on FPGA

Intel/VMware Crossroads 3D-FPGA Academic Research Center

Intel/NSERC Industrial Research Chair in Programmable Silicon

Vector Institute for Artificial Intelligence

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3634919

Reference36 articles.

1. Mohamed S. Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell, Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C. Ling, and Gordon R. Chiu. 2018. DLA: Compiler and FPGA overlay for neural network inference acceleration. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). 411–4117. 10.1109/FPL.2018.00077

2. End-to-End FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression

3. FPGA Architecture: Principles and Progression

4. Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs

5. You Cannot Improve What You Do not Measure

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SMOF: Streaming Modern CNNs on FPGAs with Smart Off-Chip Eviction;2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM);2024-05-05