Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning-Reference-Cited by-同舟云学术

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Published:2023-11-11 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
language:
Short-container-title:

Author:

Ding Qiyang¹^ORCID,Zheng Pengfei²^ORCID,Kudari Shreyas¹^ORCID,Venkataraman Shivaram²^ORCID,Zhang Zhao³^ORCID

Affiliation:

1. University of Texas at Austin, Austin, United States of America

2. University of Wisconsin-Madison, Madison, United States of America

3. Texas Advanced Computing Center (TACC), Austin, United States of America

Funder

NSF

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3581784.3607042

Reference52 articles.

1. [n. d.]. Clipped Proximal Policy Optimization. https://intellabs.github.io/coach/components/agents/policy_optimization/cppo.html. [n. d.]. Clipped Proximal Policy Optimization. https://intellabs.github.io/coach/components/agents/policy_optimization/cppo.html.

2. [n. d.]. Introducing the AI Research SuperCluster --- Meta's cutting-edge AI supercomputer for AI research. [n. d.]. Introducing the AI Research SuperCluster --- Meta's cutting-edge AI supercomputer for AI research.

3. 2022. Slurm Simulator. https://github.com/ubccr-slurm-simulator/slurm_simulator. 2022. Slurm Simulator. https://github.com/ubccr-slurm-simulator/slurm_simulator.

4. Experience Replay for Real-Time Reinforcement Learning Control

5. Sid Black , Stella Biderman , Eric Hallahan , Quentin Anthony , Leo Gao , Laurence Golding , Horace He , Connor Leahy , Kyle McDonell , Jason Phang , Michael Pieler , USVSN Sai Prashanth , Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022 . GPT-NeoX-20B: An Open-Source Autoregressive Language Model . (2022). Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. (2022).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads;Practice and Experience in Advanced Research Computing 2024: Human Powered Computing;2024-07-17

2. Creating intelligent cyberinfrastructure for democratizing AI;AI Magazine;2024-03