Baymax-Reference-Cited by-同舟云学术

Baymax

Published:2016-06-09 Issue:4 Volume:51 Page:681-696
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Chen Quan¹,Yang Hailong²,Mars Jason³,Tang Lingjia³

Affiliation:

1. Shanghai Jiao Tong University, Ann Arbor, MI, USA

2. Beihang University, Ann Arbor, USA

3. University of Michigan, Ann Arbor, USA

Abstract

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2954679.2872368

Reference68 articles.

1. Daniel Povey Arnab Ghoshal Gilles Boulianne Lukás Burget Ondrej Glembek Nagendra Goel Mirko Hannemann Petr Motlıcek Yanmin Qian Petr Schwarz etal The Kaldi Speech Recognition Toolkit. 2011. Daniel Povey Arnab Ghoshal Gilles Boulianne Lukás Burget Ondrej Glembek Nagendra Goel Mirko Hannemann Petr Motlıcek Yanmin Qian Petr Schwarz et al. The Kaldi Speech Recognition Toolkit. 2011.

2. SURF: Speeded Up Robust Features

3. Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company. Qualcomm Acquires Kooaba Visual Recognition Company. http://mobilemarketingmagazine.com/qualcomm-acquires-kooaba-visual-recognition-company.

4. Introduction to the CoNLL-2000 shared task

Cited by 91 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MIGER: Integrating Multi-Instance GPU and Multi-Process Service for Deep Learning Clusters;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12

2. BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge Devices;IEEE Transactions on Network and Service Management;2024-08

3. MediatorDNN: Contention Mitigation for Co-Located DNN Inference Jobs;2024 IEEE 17th International Conference on Cloud Computing (CLOUD);2024-07-07

4. HSAS: Efficient task scheduling for large scale heterogeneous systolic array accelerator cluster;Future Generation Computer Systems;2024-05

5. RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling;2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2024-03-02