gem5-NVDLA: A Simulation Framework for Compiling, Scheduling, and Architecture Evaluation on AI System-on-Chips-Reference-Cited by-同舟云学术

gem5-NVDLA: A Simulation Framework for Compiling, Scheduling, and Architecture Evaluation on AI System-on-Chips

Published:2024-09-04 Issue:5 Volume:29 Page:1-20
ISSN:1084-4309
Container-title:ACM Transactions on Design Automation of Electronic Systems
language:en
Short-container-title:ACM Trans. Des. Autom. Electron. Syst.

Author:

Lai Chengtao¹^ORCID,Zhang Wei¹^ORCID

Affiliation:

1. Department of Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Kowloon, Hong Kong

Abstract

Recent years have seen an increasing trend in designing AI accelerators together with the rest of the system, including CPUs and memory hierarchy. This trend calls for high-quality simulators or analytical models that enable such kind of co-exploration. Currently, the majority of such exploration is supported by AI accelerator analytical models. But such models usually overlook the non-trivial impact of congestion of shared resources, non-ideal hardware utilization and non-zero CPU scheduler overhead, which could only be modeled by cycle-level simulators. However, most simulators with full-stack toolchains are proprietary to corporations, and the few open-source simulators are suffering from either weak compilers or limited space of modeling. This framework resolves these issues by proposing a compilation and simulation flow to run arbitrary Caffe neural network models on the NVIDIA Deep Learning Accelerator (NVDLA) with gem5, a cycle-level simulator, and by adding more building blocks including scratchpad allocation, multi-accelerator scheduling, tensor-level prefetching mechanisms, and a DMA-aided embedded buffer to map workload to multiple NVDLAs. The proposed framework has been tested and verified on a set of convolution neural networks, showcasing the capability of modeling complex buffer management strategies, scheduling policies, and hardware architectures. As a case study of this framework, we demonstrate the importance of adopting different buffering strategies for activation and weight tensors in AI accelerators to acquire remarkable speedup.

Funder

Hong Kong RGC GRF

AI Chip Center for Emerging Smart Systems

Huawei Hong Kong Research Center

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3661997

Reference40 articles.

1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.

2. Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs

3. The gem5 simulator

4. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks

5. Coral. 2023. Edge TPU Compiler. Retrieved from: https://coral.ai/docs/edgetpu/compiler/#parameter-data-caching