Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory-Reference-Cited by-同舟云学术

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory

Published:2016-08-08 Issue:2 Volume:3 Page:1-28
ISSN:2329-4949
Container-title:ACM Transactions on Parallel Computing
language:en
Short-container-title:ACM Trans. Parallel Comput.

Author:

Dathathri Roshan¹,Mullapudi Ravi Teja¹,Bondhugula Uday¹

Affiliation:

1. Department of Computer Science and Automation, Indian Institute of Science

Abstract

Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism . Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them. In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization . We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2948975

Reference44 articles.

1. Using integer sets for data-parallel program analysis and optimization

2. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

3. Delta Send-Recv for Dynamic Pipelining in MPI Programs

4. Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automatic Generation of Distributed-Memory Mappings for Tensor Computations;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11

2. A Pipeline Pattern Detection Technique in Polly;Workshop Proceedings of the 51st International Conference on Parallel Processing;2022-08-29

3. TLP: Towards three‐level loop parallelisation;IET Computers & Digital Techniques;2022-08-09

4. Tile size selection of affine programs for GPGPUs using polyhedral cross-compilation;Proceedings of the ACM International Conference on Supercomputing;2021-06-03

5. Abstractions for Polyhedral Topology-Aware Tasking [Position Paper];Languages and Compilers for Parallel Computing;2021