PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel-Reference-Cited by-同舟云学术

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Published:2023-08 Issue:12 Volume:16 Page:3848-3860
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zhao Yanli¹,Gu Andrew¹,Varma Rohan¹,Luo Liang¹,Huang Chien-Chin¹,Xu Min¹,Wright Less¹,Shojanazeri Hamid¹,Ott Myle¹,Shleifer Sam¹,Desmaison Alban¹,Balioglu Can¹,Damania Pritam¹,Nguyen Bernard¹,Chauhan Geeta¹,Hao Yuchen¹,Mathews Ajit¹,Li Shen¹

Affiliation:

1. Meta AI

Abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3611540.3611569

Reference36 articles.

1. 2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html#gradient-scaling. 2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html#gradient-scaling.

2. Gradient Compression Supercharged High-Performance Data Parallel DNN Training

3. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.

4. Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management

5. Aaron Harlap , Deepak Narayanan , Amar Phanishayee , Vivek Seshadri , Nikhil Devanur , Greg Ganger , and Phil Gibbons . 2018 . Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018). Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).

Cited by 24 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation;Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems;2024-09-04

2. SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning;Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing;2024-08-04

3. RDMA over Ethernet for Distributed Training at Meta Scale;Proceedings of the ACM SIGCOMM 2024 Conference;2024-08-04

4. Simulating 500 million years of evolution with a language model;2024-07-02

5. MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29