Lyra: Elastic Scheduling for Deep Learning Clusters-Reference-Cited by-同舟云学术

Lyra: Elastic Scheduling for Deep Learning Clusters

Published:2023-05-08 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Eighteenth European Conference on Computer Systems
language:
Short-container-title:

Author:

Li Jiamin¹^ORCID,Xu Hong²^ORCID,Zhu Yibo³^ORCID,Liu Zherui⁴^ORCID,Guo Chuanxiong⁵^ORCID,Wang Cong¹^ORCID

Affiliation:

1. City University of Hong Kong, Hong Kong, Hong Kong

2. The Chinese University of Hong Kong, Hong Kong, Hong Kong

3. Google, Kirkland, United States of America

4. ByteDance Inc., Beijing, China

5. Non affiliated, Bellevue, United States of America

Funder

Research Grants Council of Hong Kong

Chinese University of Hong Kong

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3552326.3587445

Reference60 articles.

1. Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

2. Zhihao Bai , Zhen Zhang , Yibo Zhu , and Xin Jin . 2020 . PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications . In Proc. USENIX OSDI. Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications. In Proc. USENIX OSDI.

3. Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning

4. Semi-dynamic load balancing

5. Trishul Chilimbi , Yutaka Suzue , Johnson Apacible , and Karthik Kalyanaraman . 2014 . Project adam: Building an efficient and scalable deep learning training system . In Proc. USENIX OSDI. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proc. USENIX OSDI.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Application-Oriented Cloud Workload Prediction: A Survey and New Perspectives;Tsinghua Science and Technology;2025-02

2. Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses;2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS);2024-07-23

3. Enhanced Scheduling of AI Applications in Multi-Tenant Cloud Using Genetic Optimizations;Applied Sciences;2024-05-29

4. Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2;2024-04-27

5. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving;Proceedings of the 4th Workshop on Machine Learning and Systems;2024-04-22