1. 2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html#gradient-scaling. 2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html#gradient-scaling.
2. Gradient Compression Supercharged High-Performance Data Parallel DNN Training
3. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.
4. Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management
5. Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil Devanur , Greg Ganger , and Phil Gibbons . 2018 . Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018). Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).