Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Author:

Ye Zhisheng1ORCID,Gao Wei2ORCID,Hu Qinghao2ORCID,Sun Peng3ORCID,Wang Xiaolin1ORCID,Luo Yingwei1ORCID,Zhang Tianwei4ORCID,Wen Yonggang4ORCID

Affiliation:

1. Peking University, China

2. S-Lab, Nanyang Technological University, Singapore

3. Shanghai AI Laboratory & SenseTime Research, China

4. Nanyang Technological University, Singapore

Abstract

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for a GPU datacenter is crucially important to reduce operational cost and improve resource utilization. However, traditional approaches designed for big data or high-performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, many schedulers are proposed to tailor for DL workloads in GPU datacenters. This article surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource utilization manner . Finally, we discuss several promising future research directions including emerging DL workloads, advanced scheduling decision making, and underlying hardware resources. A more detailed summary of the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers

Funder

National Key R&D Program of China

RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative

National Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Reference218 articles.

1. Amazon Web Services Labs. 2022. Multi Model Server: A tool for serving neural net models for inference. https://github.com/awslabs/multi-model-server

2. OpenPBS Contributor. 2022. OpenPBS. https://www.openpbs.org/

3. Marcelo Amaral Jordà Polo David Carrera Seetharami Seelam and Malgorzata Steinder. 2017. Topology-aware GPU scheduling for learning workloads in cloud environments. InSC’17.

4. Big data computing and clouds: Trends and future directions;Assunção Marcos D.;J. Parallel Distrib. Comput.,2015

5. Zhihao Bai Zhen Zhang Yibo Zhu and Xin Jin. 2020. PipeSwitch: Fast pipelined context switching for deep learning applications. In OSDI’20.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3