Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor-Reference-Cited by-同舟云学术

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor

Published:2019-07-02 Issue:6 Volume:33 Page:1240-1254
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Denis Alexandre¹^ORCID,Jaeger Julien²,Jeannot Emmanuel¹,Pérache Marc²,Taboada Hugo²

Affiliation:

1. Inria, LaBRI, CNRS, University of Bordeaux, Bordeaux-INP, France

2. CEA, DAM, DIF, Arpajon, France

Abstract

To amortize the cost of MPI collective operations, nonblocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. In this article, we propose placement algorithms for progress threads that do not degrade performance when running on cores dedicated to communications to get communication/computation overlap. We first show that even simple collective operations, such as those based on a chain topology, are not straightforward to make progress in background on a dedicated core. Then, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented both algorithms in the multiprocessor computing framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342019860184

Reference13 articles.

1. Optimization of MPI collective communication on BlueGene/L systems

2. Pajé: An Extensible Environment for Visualizing Multi-threaded Programs Executions

3. The BXI Interconnect Architecture

4. Leveraging non-blocking collective communication in high-performance applications

5. Message progression in parallel computing - to thread or not to thread?

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A methodology for assessing computation/communication overlap of MPI nonblocking collectives;Concurrency and Computation: Practice and Experience;2022-08-05

2. IMB-ASYNC: a revised method and benchmark to estimate MPI-3 asynchronous progress efficiency;Cluster Computing;2022-01-15

3. Implementation and performance evaluation of MPI persistent collectives in MPC: a case study;27th European MPI Users' Group Meeting;2020-09-21