A decomposition for in-place matrix transposition-Reference-Cited by-同舟云学术

A decomposition for in-place matrix transposition

Published:2014-11-26 Issue:8 Volume:49 Page:193-206
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Catanzaro Bryan¹,Keller Alexander²,Garland Michael¹

Affiliation:

1. NVIDIA, Santa Clara, CA, USA

2. NVIDIA, Berlin, Germany

Abstract

We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s. Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses. In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2692916.2555253

Reference11 articles.

1. Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

2. Intel. Intel MKL 2013. URL http://software.intel.com/en-us/intel-mkl. Intel. Intel MKL 2013. URL http://software.intel.com/en-us/intel-mkl.

3. Tight bounds on the complexity of parallel sorting

4. Scalable Parallel Programming with CUDA

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enabling zero knowledge proof by accelerating zk-SNARK kernels on GPU;Journal of Parallel and Distributed Computing;2023-03

2. Optimized Computation for Determinant of Multivariate Polynomial Matrices on GPGPU;2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys);2022-12

3. FIST-HOSVD;Proceedings of the Platform for Advanced Scientific Computing Conference;2022-06-27

4. AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor;The Journal of Supercomputing;2022-01-17

5. Alpinist: An Annotation-Aware GPU Program Optimizer;Tools and Algorithms for the Construction and Analysis of Systems;2022