Accelerating distributed deep neural network training with pipelined MPI allreduce
-
Published:2021-08-07
Issue:4
Volume:24
Page:3797-3813
-
ISSN:1386-7857
-
Container-title:Cluster Computing
-
language:en
-
Short-container-title:Cluster Comput
Author:
Castelló AdriánORCID, Quintana-Ortí Enrique S., Duato José
Abstract
AbstractTensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.
Funder
Ministerio de Ciencia, Innovación y Universidades Agencia Valenciana de la Innovación PRACE preparatory access
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Software
Reference33 articles.
1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016) 2. Alsmadi, I., Khreishah, A., Dianxiang, X.: Network slicing to improve multicasting in hpc clusters. Clust. Comput. 21(3), 1493–1506 (2018) 3. Awan, A.A., Bedorf, J., Chu, C.-H., Subramoni, H., Panda, D.K.: Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: characterization, designs, and performance evaluation (2018). arXiv:1810.11112 4. Awan, A.A., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In: Proceedings of the 25th European MPI Users’ Group Meeting, pp. 1–9 (2018) 5. Ben-Nun, T., Hoefler, T.: Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52(4), 65:1–65:43 (2019)
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|