A Compiler Approach for Exploiting Partial SIMD Parallelism-Reference-Cited by-同舟云学术

A Compiler Approach for Exploiting Partial SIMD Parallelism

Published:2016-04-05 Issue:1 Volume:13 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Zhou Hao¹,Xue Jingling²

Affiliation:

1. UNSW Australia/NUDT, China

2. UNSW Australia, NSW, Australia

Abstract

Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently, where the degree of partial SIMD parallelism is smaller than the SIMD datapath width. We present a simple yet effective SLP compiler technique called P aver (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing/unpacking, and/or masking operations, without introducing new memory errors or new numeric exceptions. For a set of 9 C/C++/Fortran applications with partial SIMD parallelism, P aver achieves significantly better kernel and whole-program speedups than LLVM on both Intel’s AVX and ARM’s NEON.

Funder

Australian Research Council

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2886101

Reference43 articles.

1. Efficient Selection of Vector Instructions Using Dynamic Programming

2. Nonlinear array layouts for hierarchical memory systems

Cited by 28 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimizing Stencil Computation on Multi-core DSPs;Proceedings of the 53rd International Conference on Parallel Processing;2024-08-12

2. Boost Linear Algebra Computation Performance via Efficient VNNI Utilization;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3;2024-04-27

3. PresCount: Effective Register Allocation for Bank Conflict Reduction;2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2024-03-02

4. Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores;Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3;2023-03-25

5. High Performance and Power Efficient Accelerator for Cloud Inference;2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2023-02