Affiliation:
1. University of Cambridge, UK
Abstract
Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited.
This article develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× for an Intel Haswell processor, 1.1× for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2× for a Cortex-72 and an Intel Kaby Lake, and 1.35× for an Intel Xeon Phi Knight’s Landing, each of which is an out-of-order core, and performance improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.
Funder
ARM Ltd
Engineering and Physical Sciences Research Council
Publisher
Association for Computing Machinery (ACM)
Reference42 articles.
1. Thomas Mueller. 2012. What integer hash function are good that accepts an integer hash key? Stack Overflow. Retrieved from http://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key#12996028. Thomas Mueller. 2012. What integer hash function are good that accepts an integer hash key? Stack Overflow. Retrieved from http://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key#12996028.
2. An Event-Triggered Programmable Prefetcher for Irregular Workloads
3. The NAS parallel benchmarks---summary and preliminary results
Cited by
14 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. A prefetching indexing scheme for in-memory database systems;Future Generation Computer Systems;2024-07
2. Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses;ACM Transactions on Architecture and Code Optimization;2024-03-23
3. Decoupled Vector Runahead;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28
4. SaGraph: A Similarity-aware Hardware Accelerator for Temporal Graph Processing;2023 60th ACM/IEEE Design Automation Conference (DAC);2023-07-09
5. RISC-V-Based Evaluation and Strategy Exploration of MRAM Triple-Level Hybrid Cache Systems;IEEE Transactions on Very Large Scale Integration (VLSI) Systems;2023-07