Automated Backend Allocation for Multi-Model, On-Device AI Inference

Author:

Iyer Venkatraman1ORCID,Lee Sungho1ORCID,Lee Semun1ORCID,Kim Juitem Joonwoo1ORCID,Kim Hyunjun1ORCID,Shin Youngjae1ORCID

Affiliation:

1. Samsung Electronics, Seoul, South Korea

Abstract

On-Device Artificial Intelligence (AI) services such as face recognition, object tracking and voice recognition are rapidly scaling up deployments on embedded, memory-constrained hardware devices. These services typically delegate AI inference models for execution on CPU and GPU computing backends. While GPU delegation is a common practice to achieve high speed computation, the approach suffers from degraded throughput and completion times under multi-model scenarios, i.e. concurrently executing services. This paper introduces a solution to sustain performance in multi-model, on-device AI contexts by dynamically allocating a combination of CPU and GPU backends per model. The allocation is feedback-driven, and guided by a knowledge of model-specific, multi-objective pareto fronts comprising inference latency and memory consumption. Primary contribution of this paper is a backend allocation algorithm that runs online per model, and achieves 25-100% improvement in throughput over static allocations as well as load-balancing scheduler solutions targeting multi-model scenarios. Other noteworthy contributions include a novel pareto front estimator that runs on-device, and also a software-based GPU profiler with a lightweight algorithm to detect changing GPU workloads. Specifically, the pareto front estimator outperforms state of the art algorithms NSGA-II and SPEA2 by 94% on pareto coverage, and by almost 2x on computational overhead.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Computer Science (miscellaneous)

Reference97 articles.

1. 2011. netem tc manual. https://man7.org/linux/man-pages/man8/tc-netem.8.html. (2011). 2011. netem tc manual. https://man7.org/linux/man-pages/man8/tc-netem.8.html. (2011).

2. 2013. OpenCL 2.0 specification. https://registry.khronos.org/OpenCL/specs/opencl-2.0.pdf. (2013). 2013. OpenCL 2.0 specification. https://registry.khronos.org/OpenCL/specs/opencl-2.0.pdf. (2013).

3. 2015. grpc - A high performance open source general-purpose RPC framework. https://github.com/grpc. (2015). 2015. grpc - A high performance open source general-purpose RPC framework. https://github.com/grpc. (2015).

4. 2016. ARM Developer Streamline Performance Analyzer. https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer. (2016). 2016. ARM Developer Streamline Performance Analyzer. https://developer.arm.com/Tools%20and%20Software/Streamline%20Performance%20Analyzer. (2016).

5. 2016. SnapDragon Profiler. https://developer.qualcomm.com/software/snapdragon-profiler. (2016). 2016. SnapDragon Profiler. https://developer.qualcomm.com/software/snapdragon-profiler. (2016).

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Automated Backend Allocation for Multi-Model, On-Device AI Inference;ACM SIGMETRICS Performance Evaluation Review;2024-06-11

2. Automated Backend Allocation for Multi-Model, On-Device AI Inference;Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems;2024-06-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3