Affiliation:
1. Rutgers University, Piscataway, NJ, USA
2. Qualcomm Research, Raleigh, NC, USA
Abstract
The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory.
To this end, we are the first to explore GPU Memory Management Units(MMUs) consisting of Translation Lookaside Buffers (TLBs) and page table walkers (PTWs) for address translation in unified heterogeneous systems. We show the performance challenges posed by GPU warp schedulers on TLBs accessed in parallel with L1 caches, which provide many well-known programmability benefits. In response, we propose modest TLB and PTW augmentations that recover most of the performance lost by introducing L1 parallel TLB access. We also show that a little TLB-awareness can make other GPU performance enhancements (e.g., cache-conscious warp scheduling and dynamic warp formation on branch divergence) feasible in the face of cache-parallel address translation, bringing overheads in the range deemed acceptable for CPUs (10-15\% of runtime). We presume this initial design leaves room for improvement but anticipate that our bigger insight, that a little TLB-awareness goes a long way in GPUs, will spur further work in this fruitful area.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference57 articles.
1. AMD "AMD I/O Virtualization Technology (IOMMU) Specification " 2006. AMD "AMD I/O Virtualization Technology (IOMMU) Specification " 2006.
2. IOMMU: Strategies for Mitigating the IOTLB Bottleneck
3. Andrea Arcangeli "Transparent Hugepage Support " KVM Forum 2010. Andrea Arcangeli "Transparent Hugepage Support " KVM Forum 2010.
4. Translation caching
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Reducing TLB Miss Penalty on GPUs via Unified Multi-level PWB and PWC;2021 12th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP);2021-12-10
2. Translation ranger;Proceedings of the 46th International Symposium on Computer Architecture;2019-06-22
3. Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs;ACM Transactions on Embedded Computing Systems;2017-10-31