Optimization of Collective Communication Operations in MPICH-Reference-Cited by-同舟云学术

Optimization of Collective Communication Operations in MPICH

Published:2005-02 Issue:1 Volume:19 Page:49-66
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Thakur Rajeev¹,Rabenseifner Rolf²,Gropp William¹

Affiliation:

1. MATHEMATICS AND COMPUTER SCIENCE DIVISION ARGONNE NATIONAL LABORATORY ARGONNE, IL 60439, USA

2. RECHENZENTRUM UNIVERSITAT STUTTGART (RUS) HIGH PERFORMANCE COMPUTING CENTER (HLRS) UNIVERSITY OF STUTTGART D-70550 STUTTGART, GERMANY

Abstract

We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM's MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342005051521

Reference26 articles.

1. LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation

2. A Comparison of MPICH Allgather Algorithms on Switched Networks

Cited by 537 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation;Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems;2024-09-04

2. OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design;2024 IEEE Symposium on High-Performance Interconnects (HOTI);2024-08-21

3. Towards a Standardized Representation for Deep Learning Collective Algorithms;2024 IEEE Symposium on High-Performance Interconnects (HOTI);2024-08-21

4. Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12

5. Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep Learning;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12