1. Mandeep Baines Shruti Bhosale Vittorio Caggiano Naman Goyal Siddharth Goyal Myle Ott Benjamin Lefaudeux Vitaliy Liptchinsky Mike Rabbat Sam Sheiffer Anjali Sridhar and Min Xu. 2021. FairScale: A general purpose modular PyTorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale. Mandeep Baines Shruti Bhosale Vittorio Caggiano Naman Goyal Siddharth Goyal Myle Ott Benjamin Lefaudeux Vitaliy Liptchinsky Mike Rabbat Sam Sheiffer Anjali Sridhar and Min Xu. 2021. FairScale: A general purpose modular PyTorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale.
2. Zhengda Bian , Hongxin Liu , Boxiang Wang , Haichen Huang , Yongbin Li , Chuanrui Wang , Fan Cui , and Yang You . 2021. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. arXiv preprint arXiv:2110.14883 ( 2021 ). Zhengda Bian, Hongxin Liu, Boxiang Wang, Haichen Huang, Yongbin Li, Chuanrui Wang, Fan Cui, and Yang You. 2021. Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training. arXiv preprint arXiv:2110.14883 (2021).
3. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.
4. Tianqi Chen , Bing Xu , Chiyuan Zhang , and Carlos Guestrin . 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 ( 2016 ). Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
5. Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).