Extracting SIMD Parallelism from Recursive Task-Parallel Programs-Reference-Cited by-同舟云学术

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Published:2019-12-26 Issue:4 Volume:6 Page:1-37
ISSN:2329-4949
Container-title:ACM Transactions on Parallel Computing
language:en
Short-container-title:ACM Trans. Parallel Comput.

Author:

Ren Bin¹,Balakrishna Shruthi²,Jo Youngjoon²,Krishnamoorthy Sriram³,Agrawal Kunal⁴,Kulkarni Milind²

Affiliation:

1. William 8 Mary, Pacific Northwest National Laboratory

2. Purdue University

3. Pacific Northwest National Laboratory

4. Washington University in St. Louis

Abstract

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations in a vectorized manner efficiently. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This article presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

Funder

NSF

Battelle for DOE

U.S. Department of Energy's (DOE) Office of Science, Office of Advanced Scientific Computing Research, under DOE Early Career

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3365663

Reference66 articles.

1. Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In HPG’09. 145--149. Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In HPG’09. 145--149.

2. Barcelona OpenMP Task Suite (BOTS) 2012. Barcelona OpenMP Task Suite (BOTS). https://pm.bsc.es/projects/bots. Barcelona OpenMP Task Suite (BOTS) 2012. Barcelona OpenMP Task Suite (BOTS). https://pm.bsc.es/projects/bots.

3. Lars Bergstrom Matthew Fluet Mike Rainey John Reppy Stephen Rosen and Adam Shaw. 2013. Data-only flattening for nested data parallelism. ACM SIGPLAN Notices 48. ACM 81--92. Lars Bergstrom Matthew Fluet Mike Rainey John Reppy Stephen Rosen and Adam Shaw. 2013. Data-only flattening for nested data parallelism. ACM SIGPLAN Notices 48. ACM 81--92.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Parallel approaches for a decision tree-based explainability algorithm;Future Generation Computer Systems;2024-09