1. Athlur, S., Saran, N., Sivathanu, M., Ramjee, R., Kwatra, N.: Varuna: scalable, low-cost training of massive deep learning models. In: Proceedings of the Seventeenth European Conference on Computer Systems, pp. 472–487 (2022)
2. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, 1877–1901 (2020)
3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
4. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
5. Eliad, S., Hakimi, I., De Jagger, A., Silberstein, M., Schuster, A.: Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism. In: 2021 USENIX Annual Technical Conference, pp. 381–396 (2021)