MASK

Author:

Ausavarungnirun Rachata1,Miller Vance2,Landgraf Joshua2,Ghose Saugata1,Gandhi Jayneel3,Jog Adwait4,Rossbach Christopher J.5,Mutlu Onur6

Affiliation:

1. Carnegie Mellon University, Pittsburgh, PA, USA

2. University of Texas at Austin, Austin, TX, USA

3. VMware Research, Palo Alto, CA, USA

4. College of William and Mary, Williamsburg, VA, USA

5. University of Texas at Austin&VMware Research, Austin, TX, USA

6. ETH Zürich&Carnegie Mellon University, Zurich, Switzerland

Abstract

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.

Funder

NSF

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Reference150 articles.

1. The case for GPGPU spatial multitasking

2. Advanced Micro Devices Inc. "AMD Accelerated Processing Units " http://www.amd.com/us/products/technologies/apu/Pages/apu.aspx. Advanced Micro Devices Inc. "AMD Accelerated Processing Units " http://www.amd.com/us/products/technologies/apu/Pages/apu.aspx.

3. Advanced Micro Devices Inc. "AMD Radeon R9 290X " http://www.amd.com/us/press-releases/Pages/amd-radeon-r9--290x-2013oct24.aspx. Advanced Micro Devices Inc. "AMD Radeon R9 290X " http://www.amd.com/us/press-releases/Pages/amd-radeon-r9--290x-2013oct24.aspx.

4. Advanced Micro Devices Inc. "ATI Radeon GPGPUs " http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/Pages/amd-radeon-hd-6000.aspx. Advanced Micro Devices Inc. "ATI Radeon GPGPUs " http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/Pages/amd-radeon-hd-6000.aspx.

Cited by 16 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan;2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS);2024-05-13

2. Enabling Efficient Spatio-Temporal GPU Sharing for Network Function Virtualization;IEEE Transactions on Computers;2023-10

3. GPU Performance Acceleration via Intra-Group Sharing TLB;Proceedings of the 52nd International Conference on Parallel Processing;2023-08-07

4. Operand-Oriented Virtual Memory Support for Near-Memory Processing;IEEE Transactions on Computers;2023-08-01

5. KRISP: Enabling Kernel-wise RIght-sizing for Spatial Partitioned GPU Inference Servers;2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2023-02

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3