Abstract
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and performance must be reevaluated. Our article begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases.
In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of Data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements.
Because of their benefits for parallel applications, their applicability to sequential workloads, and their readily implementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.
Funder
Gigascale Systems Research Center
Division of Computer and Network Systems
Focus Center Research Program
Intel Corporation
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference42 articles.
1. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors
2. The PARSEC benchmark suite
3. Chen T. and Baer J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 10.1109/12.381947 Chen T. and Baer J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput . 10.1109/12.381947
4. A simulation based study of TLB performance
Cited by
57 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. How to Be Fast and Not Furious: Looking Under the Hood of CPU Cache Prefetching;Proceedings of the 20th International Workshop on Data Management on New Hardware;2024-06-09
2. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2;2024-04-27
3. Reconfigurable Virtual Memory for FPGA-Driven I/O;Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3;2023-03-25
4. Fine-grain data classification to filter token coherence traffic;Journal of Parallel and Distributed Computing;2023-01
5. Eager Memory Cryptography in Caches;2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO);2022-10