1. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA, 265--283.
2. Amazon. 2023. Amazon SageMaker. https://aws.amazon.com/sagemaker.
3. Weights & Biases. 2023. Current Best Practices for Training LLMs from Scratch. https://wandb.ai/site/llm-whitepaper.
4. Scott Boag, Parijat Dube, Benjamin Herta, Waldemar Hummer, Vatche Ishakian, K JAYARAM, Michael Kalantar, Vinod Muthusamy, Priya NAG-PURKAR, and Florian Rosenberg. 2017. Scalable multi-framework multi-tenant lifecycle management of deep learning training jobs. In Workshop on ML Systems, NIPS.
5. Samira Briongos, Pedro Malagón, José L. Risco, and José M. Moya. 2017. Building Accurate Models to Determine the Current CPU Utilization of a Host within a Virtual Machine Allocated on It. In Proceedings of the Summer Simulation Multi-Conference (Bellevue, Washington) (SummerSim '17). Society for Computer Simulation International, San Diego, CA, USA, Article 33, 12 pages.