Accelerating distributed deep neural network training with pipelined MPI allreduce-Reference-Cited by-同舟云学术

Accelerating distributed deep neural network training with pipelined MPI allreduce

Published:2021-08-07 Issue:4 Volume:24 Page:3797-3813
ISSN:1386-7857
Container-title:Cluster Computing
language:en
Short-container-title:Cluster Comput

Author:

Castelló Adrián^ORCID,Quintana-Ortí Enrique S.,Duato José

Abstract

AbstractTensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.

Funder

Ministerio de Ciencia, Innovación y Universidades

Agencia Valenciana de la Innovación

PRACE preparatory access

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Software

Link

https://link.springer.com/content/pdf/10.1007/s10586-021-03370-9.pdf

Reference33 articles.

1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016)

2. Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018)

3. Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112

4. Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018)

5. Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019)

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters;ISC High Performance 2024 Research Paper Proceedings (39th International Conference);2024-05

2. SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications;Journal of Parallel and Distributed Computing;2024-01

3. Uniform Algorithms for Reduce-scatter and (most) other Collectives for MPI;2023 IEEE International Conference on Cluster Computing (CLUSTER);2023-10-31

4. Interactive visual analytics of parallel training strategies for DNN models;Computers & Graphics;2023-10

5. Accelerating communication with multi‐HCA aware collectives in MPI;Concurrency and Computation: Practice and Experience;2023-08-09