Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads-Reference-Cited by-同舟云学术

Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads

Published:2022-07-18 Issue:1 Volume:79 Page:762-787
ISSN:0920-8542
Container-title:The Journal of Supercomputing
language:en
Short-container-title:J Supercomput

Author:

Segura Albert^ORCID,Arnau Jose Maria,Gonzalez Antonio

Abstract

AbstractGPGPU architectures have become the dominant platform for massively parallel workloads, delivering high performance and energy efficiency for popular applications such as machine learning, computer vision or self-driving cars. However, irregular applications, such as graph processing, fail to fully exploit GPGPU resources due to their divergent memory accesses that saturate the memory hierarchy. To reduce the pressure on the memory subsystem for divergent memory-intensive applications, programmers must take into account SIMT execution model and memory coalescing in GPGPUs, devoting significant efforts in complex optimization techniques. Despite these efforts, we show that irregular graph processing still suffers from low GPGPU performance. We observe that in many irregular applications the mapping of data to threads can be safely changed. In other words, it is possible to relax the strict relationship between thread and data processed to reduce memory divergence. Based on this observation, we propose the Irregular accesses Reorder Unit (IRU), a novel hardware extension tightly integrated in the GPGPU pipeline. The IRU reorders data processed by the threads on irregular accesses to improve memory coalescing, i.e., it tries to assign data elements to threads as to produce coalesced accesses in SIMT groups. Furthermore, the IRU is capable of filtering and merging duplicated accesses, significantly reducing the workload. Programmers can easily utilize the IRU with a simple API, or let the compiler issue instructions from our extended ISA. We evaluate our proposal for state-of-the-art graph-based algorithms and a wide selection of applications. Results show that the IRU achieves a memory coalescing improvement of 1.32x and a 46% reduction in the overall traffic in the memory hierarchy, which results in 1.33x speedup and 13% energy savings on average, while incurring in a small 5.6% area overhead.

Funder

H2020 European Research Council

Agencia Estatal de Investigación

Universitat Politècnica de Catalunya

Publisher

Springer Science and Business Media LLC

Subject

Hardware and Architecture,Information Systems,Theoretical Computer Science,Software

Link

https://link.springer.com/content/pdf/10.1007/s11227-022-04621-1.pdf

Reference49 articles.

1. Bell N, Garland M (2008) Efficient sparse matrix-vector multiplication on cuda. Technical report, Nvidia Technical Report NVR-2008-004, Nvidia Corporation

2. Li J, Ranka S, Sahni S (2011) Strassen’s matrix multiplication on gpus. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pp 157–164. IEEE

3. Root C, Mostak T (2016) MAPD: a GPU-powered big data analytics and visualization platform. In: ACM SIGGRAPH 2016 Talks, pp 1–2

4. Yan M, Chen Z, Deng L, Ye X, Zhang Z, Fan D, Xie Y (2020) Characterizing and understanding GCNs on GPU. IEEE Comput Archit Lett 19(1):22–25

5. Chong J, Gonina E, Keutzer K (2011) Efficient automatic speech recognition on the GPU. In: GPU Computing Gems Emerald Edition, pp 601–618

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. e-CLAS: Effective GPUDirect I/O Classification Scheme;Lecture Notes in Computer Science;2024