1. Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning
2. Estimating GPU memory consumption of deep learning models
3. Mangpo Phothilimthana, Sami Abu-El-Haija, Kaidi Cao, Bahare Fatemi, Michael Burrows, Charith Mendis, and Bryan Perozzi. 2024. TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs. Advances in Neural Information Processing Systems 36 (2024).
4. Tapis: An API Platform for Reproducible, Distributed Computational Research
5. Sahil Tyagi and Prateek Sharma. 2023. Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training. In 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 403–413.