DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping-Reference-Cited by-同舟云学术

DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping

Published:2024-08-20 Issue: Volume: Page:
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Tang Yu¹^ORCID,Li Qiao²^ORCID,Yin Lujia¹^ORCID,Li Dongsheng¹^ORCID,Zhang Yiming²^ORCID,Wang Chenyu³^ORCID,Zhang Xingcheng⁴^ORCID,Qiao Linbo¹^ORCID,Zhang Zhaoning¹^ORCID,Lu Kai¹^ORCID

Affiliation:

1. National University of Defense Technology, Changsha, China

2. Xiamen University, Xiamen, China

3. Sensetime, Shanghai China

4. Shanghai Artificial Intelligence Laboratory, Shanghai China

Abstract

To accommodate the increasingly large-scale models within limited-capacity GPU memory, various coarse-grained techniques, such as recomputation and swapping, have been proposed to optimize memory usage. However, these methods have encountered limitations, either in terms of inefficient memory reduction or diminished training performance. In response to this, our paper introduces DELTA, an innovative approach for memory-efficient large-scale model training that combines fine-grained memory optimization and prefetching technology to reduce memory usage while maintaining high training throughput concurrently. Initially, we formulate the problem of memory-throughput joint optimization as an easy-solving 0/1 Knapsack problem. Leveraging this formalization, we use an improving polynomial complexity heuristic algorithm to address the problem effectively. Furthermore, we introduce a novel bidirectional prefetching technology into dynamic memory management, which significantly accelerates the model training when compared to relying solely on recomputation or swapping. Finally, DELTA offers users an automated training execution library, eliminating the need for manual configuration or specialized expertise. Experimental results demonstrate the effectiveness of DELTA in reducing GPU memory consumption. Compared to state-of-the-art methods, DELTA achieves substantial memory savings ranging from 40% to 72%, while maintaining comparable convergence performance for various models, including ResNet-50, ResNet-101, and BERT-Large. Notably, DELTA enables the training of GPT2-Large and GPT2-XL with batch sizes increased by 5.5 × and 6 ×, respectively, showcasing its versatility and practicality in enabling large-scale model training on GPU hardware.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3689338

Reference60 articles.

1. Gholami Amir, Yao Zhewei, Kim Sehoon, Mahoney Michael W, and Keutzer Kurt. 2021. AI and Memory Wall. RiseLab Medium Post (2021).

2. Olivier Beaumont, Lionel Eyraud-Dubois, and Alena Shilova. 2021. Efficient Combination of Rematerialization and Offloading for Training DNNs. Advances in Neural Information Processing Systems 34 (2021).

3. The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters

4. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.

5. In-place Activated BatchNorm for Memory-Optimized Training of DNNs