TLB Improvements for Chip Multiprocessors

Author:

Lustig Daniel1,Bhattacharjee Abhishek2,Martonosi Margaret1

Affiliation:

1. Princeton University

2. Rutgers University

Abstract

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and performance must be reevaluated. Our article begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases. In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of Data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements. Because of their benefits for parallel applications, their applicability to sequential workloads, and their readily implementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.

Funder

Gigascale Systems Research Center

Division of Computer and Network Systems

Focus Center Research Program

Intel Corporation

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Reference42 articles.

1. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors

2. The PARSEC benchmark suite

3. Chen T. and Baer J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. 10.1109/12.381947 Chen T. and Baer J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput . 10.1109/12.381947

4. A simulation based study of TLB performance

Cited by 57 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. How to Be Fast and Not Furious: Looking Under the Hood of CPU Cache Prefetching;Proceedings of the 20th International Workshop on Data Management on New Hardware;2024-06-09

2. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2;2024-04-27

3. Reconfigurable Virtual Memory for FPGA-Driven I/O;Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3;2023-03-25

4. Fine-grain data classification to filter token coherence traffic;Journal of Parallel and Distributed Computing;2023-01

5. Eager Memory Cryptography in Caches;2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO);2022-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3