Efficient Nearest-Neighbor Data Sharing in GPUs-Reference-Cited by-同舟云学术

Efficient Nearest-Neighbor Data Sharing in GPUs

Published:2021-01-21 Issue:1 Volume:18 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Nematollahi Negin¹,Sadrosadati Mohammad²,Falahati Hajar²,Barkhordar Marzieh¹,Drumond Mario Paulo³,Sarbazi-Azad Hamid⁴,Falsafi Babak³

Affiliation:

1. Department of Computer Engineering, Sharif University of Technology, Iran

2. School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Iran

3. EPFL University, Switzerland

4. Department of Computer Engineering, Sharif University of Technology, Iran and School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Iran

Abstract

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses. In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3429981

Reference78 articles.

1. Reducing Power Consumption of GPGPUs Through Instruction Reordering

2. Enabling GPGPU Low-Level Hardware Explorations with MIAOW

3. Exploring architectural heterogeneity in intelligent vision systems

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Accelerating multivariate functional approximation computation with domain decomposition techniques;Journal of Computational Science;2024-06

2. Comparison heuristics method for solving two echelon vehicle routing problem;AIP Conference Proceedings;2024

3. Snake: A Variable-length Chain-based Prefetching for GPUs;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

4. OSM: Off-Chip Shared Memory for GPUs;IEEE Transactions on Parallel and Distributed Systems;2022-12-01

5. NURA;Proceedings of the ACM on Measurement and Analysis of Computing Systems;2022-02-24