Deep Learning Inferencing with High-performance Hardware Accelerators-Reference-Cited by-同舟云学术

Deep Learning Inferencing with High-performance Hardware Accelerators

Published:2023-06-15 Issue:4 Volume:14 Page:1-25
ISSN:2157-6904
Container-title:ACM Transactions on Intelligent Systems and Technology
language:en
Short-container-title:ACM Trans. Intell. Syst. Technol.

Author:

Kljucaric Luke¹^ORCID,George Alan D.¹^ORCID

Affiliation:

1. NSF SHREC Center, ECE Dept., University of Pittsburgh, USA

Abstract

As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many apps, like real-time video processing, are focused on latency of computations rather than strictly on throughput. This research analyzes multiple compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency and maximum throughput for optical character recognition. Considering that these models are composed of fundamental neural network operations yet architecturally different from each other, these models can stress devices in different yet insightful ways that generalizations of the performance of other models can be drawn from. Many devices featuring ML-specific hardware and optimizations are analyzed including Intel and AMD CPUs, Xilinx and Intel FPGAs, NVIDIA GPUs, and Google TPUs. Overall, ML-oriented hardware added to the Intel Xeon CPUs helps to boost throughput by 3.7× and to reduce latency by up to 34.7×, which makes the latency of Intel Xeon CPUs competitive on more parallel models. The TPU devices were limited in terms of throughput due to large data transfer times and not competitive in terms of latency. The FPGA frameworks showcase the lowest latency on the Xilinx Alveo U200 FPGA achieving 0.48 ms on AlexNet using Mipsology Zebra and 0.39 ms on GoogLeNet using Vitis-AI. Through their custom acceleration datapaths coupled with high-performance SRAM, the FPGAs are able to keep critical model data closer to processing elements for lower latency. The massively parallel and high-memory GPU devices with Tensor Core accelerators achieve the best throughput. The NVIDIA Tesla A100 GPU showcases the highest throughput at 42,513 and 52,484 images/second for AlexNet and GoogLeNet, respectively. 1

Funder

SHREC industry and agency members and by the IUCRC Program of the National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3594221

Reference65 articles.

1. Fast convolutional neural networks on FPGAs with hls4ml

2. Martín Abadi Ashish Agarwal et al. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/.

3. AMD. 2022. 2nd gen AMD EPYC 7702. https://www.amd.com/en/products/cpu/amd-epyc-7702.

4. Google Cloud. 2022. Cloud TPU breaks scalability records for AI inference. https://cloud.google.com/blog/products/ai-machine-learning/cloud-tpu-breaks-scalability-records-for-ai-inference.

5. Google Cloud. 2022. Cloud TPU system architecture. https://cloud.google.com/tpu/docs/system-architecture-tpu-vm.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Chirped apodized fiber Bragg gratings inverse design via deep learning;Optics & Laser Technology;2025-02

2. Nanophotonic structure inverse design for switching application using deep learning;Scientific Reports;2024-09-10

3. Fast prototyping of Quantized neural networks on an FPGA edge computing device with Brevitas and FINN;2024 Fifteenth International Conference on Ubiquitous and Future Networks (ICUFN);2024-07-02

4. Decentralized Identity Management and Privacy-Enhanced Federated Learning for Automotive Systems: A Novel Framework;2024 IEEE 27th International Symposium on Real-Time Distributed Computing (ISORC);2024-05-22

5. Implementing an Integrated Neural Network for Real-Time Position Reconstruction in Emission Tomography With Monolithic Scintillators;IEEE Transactions on Radiation and Plasma Medical Sciences;2024-05