ARM-CO-UP: ARM COoperative Utilization of Processors-Reference-Cited by-同舟云学术

ARM-CO-UP: ARM COoperative Utilization of Processors

Published:2024-09-04 Issue:5 Volume:29 Page:1-30
ISSN:1084-4309
Container-title:ACM Transactions on Design Automation of Electronic Systems
language:en
Short-container-title:ACM Trans. Des. Autom. Electron. Syst.

Author:

Aghapour Ehsan¹^ORCID,Sapra Dolly¹^ORCID,Pimentel Andy¹^ORCID,Pathania Anuj¹^ORCID

Affiliation:

1. University of Amsterdam, Amsterdam, Netherlands

Abstract

HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor. We present the ARM-CO-UP framework built on the ARM-CL library. The ARM-CO-UP framework supports two modes of operation – Pipeline and Switch. It optimizes inference throughput using pipelined execution of network partitions for consecutive input frames in the Pipeline mode. It improves inference latency through layer-switched inference for a single input frame in the Switch mode. Furthermore, it supports layer-wise CPU/GPU DVFS in both modes for improving power efficiency and energy consumption. ARM-CO-UP is a comprehensive framework for multi-processor CNN inference that automates CNN partitioning and mapping, pipeline synchronization, processor type switching, layer-wise DVFS , and closed-source NPU integration.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3656472

Reference30 articles.

1. CPU-GPU Layer-Switched Low Latency CNN Inference

2. PELSI: Power-Efficient Layer-Switched Inference

3. PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

4. Review of Low Frame Rate Effects on Human Performance

5. Tianqi Chen Thierry Moreau Ziheng Jiang Lianmin Zheng Eddie Yan Haichen Shen Meghan Cowan et al. 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) 578–594.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PiQi: Partially Quantized DNN Inference on HMPSoCs;Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design;2024-08-05