Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training-Reference-Cited by-同舟云学术

Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training

Published:2023-12-14 Issue:4 Volume:20 Page:1-25
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Wei Jia¹^ORCID,Zhang Xingjun¹^ORCID,Wang Longxiang¹^ORCID,Wei Zheng¹^ORCID

Affiliation:

1. Xi’an Jiaotong University, China

Abstract

In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps), which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this article, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between the NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch.save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3630108

Reference40 articles.

1. Dynamic Memory Management for GPU-Based Training of Deep Neural Networks

2. Jonghyun Bae, Jongsung Lee, Yunho Jin, Sam Son, Shine Kim, Hakbeom Jang, Tae Jun Ham, and Jae W. Lee. 2021. FlashNeuron: SSD-enabled large-batch training of very deep neural networks. In Proceedings of the 19th USENIX Conference on File and Storage Technologies, Marcos K. Aguilera and Gala Yadgar (Eds.). USENIX Association, 387–401.

3. SPIN

4. CSWAP: A Self-Tuning Compression Framework for Accelerating Tensor Swapping in GPUs

5. moDNN: Memory Optimal Deep Neural Network Training on Graphics Processing Units