Memory access coalescing-Reference-Cited by-同舟云学术

Memory access coalescing

Published:1994-06 Issue:6 Volume:29 Page:186-195
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Davidson Jack W.¹,Jinturkar Sanjay¹

Affiliation:

1. Department of Computer Science, Thomton Hall, University of Virginia, Charlottesville, VA, U.S.A.

Abstract

As microprocessor speeds increase, memory bandwidth is increasingly the performance bottleneck for microprocessors. This has occurred because innovation and technological improvements in processor design have outpaced advances in memory design. Most attempts at addressing this problem have involved hardware solutions. Unfortunately, these solutions do little to help the situation with respect to current microprocessors. In previous work, we developed, implemented, and evaluated an algorithm that exploited the ability of newer machines with wide-buses to load/store multiple floating-point operands in a single memory reference. This paper describes a general code improvement algorithm that transforms code to better exploit the available memory bandwidth on existing microprocessors as well as wide-bus machines. Where possible and advantageous, the algorithm coalesces narrow memory references into wide ones. An interesting characteristic of the algorithm is that some decisions about the applicability of the transformation are made at run time. This dynamic analysis significantly increases the probability of the transformation being applied. The code improvement transformation was implemented and added to the repertoire of code improvements of an existing retargetable optimizing back end. Using three current architectures as evaluation platforms, the effectiveness of the transformation was measured on a set of compute- and memory-intensive programs. Interestingly, the effectiveness of the transformation varied significantly with respect to the instruction-set architecture of the tested platform. For one of the tested architectures, improvements in execution speed ranging from 5 to 40 percent were observed. For another, the improvements in execution speed ranged from 5 to 20 percent, while for yet another, the transformation resulted in slower code for all programs.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/773473.178259

Reference22 articles.

1. Code generation for streaming: an access/execute mechanism

2. A portable global optimizer and linker

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Seamless GPU Acceleration for C++-Based Physics with the Metal Shading Language on Apple’s M Series Unified Chips;Seismological Research Letters;2023-02-06

2. Intermediate Representations for Explicitly Parallel Programs;ACM Computing Surveys;2022-06-30

3. Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects;Proceedings of the 2022 International Conference on Management of Data;2022-06-10

4. Triton: an intermediate language and compiler for tiled neural network computations;Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2019;2019

5. DyCache: Dynamic Multi-Grain Cache Management for Irregular Memory Accesses on GPU;IEEE Access;2018