COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop-Reference-Cited by-同舟云学术

COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop

Published:2024-01-19 Issue:1 Volume:21 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Mishra Prasoon¹^ORCID,Nandivada V. Krishna¹^ORCID

Affiliation:

1. Indian Institute of Technology Madras, India

Abstract

Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP. In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total number of 275 configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3633331

Reference37 articles.

1. Cilk: An efficient multithreaded runtime system;Blumofe Robert D.;SIGPLAN Not.,1995

2. Robert D. Blumofe and Dionisios Papadopoulos. 1998. Hood: A User-Level Threads Library for Multiprogrammed Multiprocessors. Technical Report.

3. An adaptive self-scheduling loop scheduler;Booth Joshua Dennis;CCPE,2022

4. J. Mark Bull. 1998. Feedback guided dynamic loop scheduling: Algorithms and experiments. In EuroPar. 377–382.

5. Vincent Cavé Jisheng Zhao Jun Shirako and Vivek Sarkar. 2011. Habanero-Java: The new adventures of old X10. ACM 51–61. DOI:10.1145/2093157.2093165