Extension VM: Interleaved Data Layout in Vector Memory

Author:

Zhang Dunbo1,Lang Qingjie1,Wang Ruoxi1,Shen Li2

Affiliation:

1. National University of Defense Technology, China

2. College of Computer, Key Laboratory of Advanced Microprocessor Chips and Systems, National University of Defense Technology, China

Abstract

While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout mapping method creates an irreconcilable conflict between row- and column-major accesses. Ideally, both row- and column-major accesses can take advantage of the bank parallelism of vector memory. To this end, we propose the Interleaved Data Layout (IDL) method in vector memory, which can distribute vector elements into different banks regardless of whether they are in the row- or column major category, so that any vector memory access can benefit from bank parallelism. Additionally, we propose an Extension Vector Memory (EVM) architecture to achieve IDL in vector memory. EVM can support two data layout methods and vector memory access modes simultaneously. The key idea is to continuously distribute the data that needs to be accessed from the main memory to different banks during the loading period. Thus, EVM can provide a larger spatial locality level through careful programming and the extension ISA support. The experimental results showed a 1.43-fold improvement of state-of-the-art vector processors by the proposed architecture, with an area cost of only 1.73%. Furthermore, the energy consumption was reduced by 50.1%.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Reference57 articles.

1. [n. d.]. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf [n. d.]. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf

2. An Updated Set of Basic Linear Algebra Subprograms (BLAS);ACM Trans. Math. Softw.,2002

3. The input/output complexity of sorting and related problems

4. Berkin Akin , Franz Franchetti , and James  C. Hoe . 2014 . FFTS with near-optimal memory access through block data layouts . In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3898–3902 . https://doi.org/10.1109/ICASSP.2014.6854332 10.1109/ICASSP.2014.6854332 Berkin Akin, Franz Franchetti, and James C. Hoe. 2014. FFTS with near-optimal memory access through block data layouts. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3898–3902. https://doi.org/10.1109/ICASSP.2014.6854332

5. ANDES Technology . 2020. AndesCore NX27V Processor . http://https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v//, Last accessed on 2021-11-03. ANDES Technology. 2020. AndesCore NX27V Processor. http://https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v//, Last accessed on 2021-11-03.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3