Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU-Reference-Cited by-同舟云学术

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Published:2021-10-17 Issue:10 Volume:12 Page:1262
ISSN:2072-666X
Container-title:Micromachines
language:en
Short-container-title:Micromachines

Author:

Fang Juan^ORCID,Wei Zelin,Yang Huijing

Abstract

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

Funder

National Natural Science Foundation of China

Beijing Natural Science Foundation

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Mechanical Engineering,Control and Systems Engineering

Link

https://www.mdpi.com/2072-666X/12/10/1262/pdf

Reference37 articles.

1. The Computational Efficiency of Monte Carlo Breakage of Articles using Serial and Parallel Processing: A Comparison

2. Efficient Data Redistribution to Speedup Big Data Analytics in Large Systems

3. NVIDIA Tesla: A Unified Graphics and Computing Architecture

4. Professional CUDA C Programming;Cheng,2014

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Architecture-Aware Currying;2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT);2023-10-21

2. LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs;IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences;2023-08-01

3. L2 Cache Access Pattern Analysis using Static Profiling of an Application;2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC);2023-06

4. Criticality-aware priority to accelerate GPU memory access;The Journal of Supercomputing;2022-07-06