Extension VM: Interleaved Data Layout in Vector Memory-Reference-Cited by-同舟云学术

Extension VM: Interleaved Data Layout in Vector Memory

Published:2023-11-07 Issue: Volume: Page:
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Zhang Dunbo¹,Lang Qingjie¹,Wang Ruoxi¹,Shen Li²

Affiliation:

1. National University of Defense Technology, China

2. College of Computer, Key Laboratory of Advanced Microprocessor Chips and Systems, National University of Defense Technology, China

Abstract

While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout mapping method creates an irreconcilable conflict between row- and column-major accesses. Ideally, both row- and column-major accesses can take advantage of the bank parallelism of vector memory. To this end, we propose the Interleaved Data Layout (IDL) method in vector memory, which can distribute vector elements into different banks regardless of whether they are in the row- or column major category, so that any vector memory access can benefit from bank parallelism. Additionally, we propose an Extension Vector Memory (EVM) architecture to achieve IDL in vector memory. EVM can support two data layout methods and vector memory access modes simultaneously. The key idea is to continuously distribute the data that needs to be accessed from the main memory to different banks during the loading period. Thus, EVM can provide a larger spatial locality level through careful programming and the extension ISA support. The experimental results showed a 1.43-fold improvement of state-of-the-art vector processors by the proposed architecture, with an area cost of only 1.73%. Furthermore, the energy consumption was reduced by 50.1%.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3631528

Reference57 articles.

1. [n. d.]. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf [n. d.]. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

2. An Updated Set of Basic Linear Algebra Subprograms (BLAS);ACM Trans. Math. Softw.,2002

3. The input/output complexity of sorting and related problems

4. Berkin Akin , Franz Franchetti , and James C. Hoe . 2014 . FFTS with near-optimal memory access through block data layouts . In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3898–3902 . https://doi.org/10.1109/ICASSP.2014.6854332 10.1109/ICASSP.2014.6854332 Berkin Akin, Franz Franchetti, and James C. Hoe. 2014. FFTS with near-optimal memory access through block data layouts. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3898–3902. https://doi.org/10.1109/ICASSP.2014.6854332

5. ANDES Technology . 2020. AndesCore NX27V Processor . http://https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v//, Last accessed on 2021-11-03. ANDES Technology. 2020. AndesCore NX27V Processor. http://https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v//, Last accessed on 2021-11-03.