1. BLOOM Training. https://huggingface.co/blog/bloom-megatron-deepspeed#training-difficulties.
2. Boost Checkpoint Speed and Reduce Cost with Nebula. https://learn.microsoft.com/en-us/azure/machine-learning/reference-checkpoint-performance-for-large-models.
3. CRIU: Checkpoint Restore in Userspace. https://criu.org/Main_Page.
4. Deepspeed. https://www.deepspeed.ai/.
5. DeepSpeed: Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/.