Memory Row Reuse Distance and its Role in Optimizing Application Performance-Reference-Cited by-同舟云学术

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Published:2015-06-24 Issue:1 Volume:43 Page:137-149
ISSN:0163-5999
Container-title:ACM SIGMETRICS Performance Evaluation Review
language:en
Short-container-title:SIGMETRICS Perform. Eval. Rev.

Author:

Kandemir Mahmut¹,Zhao Hui¹,Tang Xulong¹,Karakoy Mustafa²

Affiliation:

1. The Pennsylvania State University, University Park, PA, USA

2. TOBB ETU, Ankara, Turkey

Abstract

Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally important to reduce LLC miss latencies. One of the critical factors that influence LLC miss latencies is row-buffer locality (i.e., the fraction of LLC misses that hit in the large buffer attached to a memory bank). While there has been a plethora of recent works on optimizing row-buffer performance, to our knowledge, there is no study that quantifies the full potential of row-buffer locality and impact of maximizing it on application performance. Focusing on multithreaded applications, the first contribution of this paper is the definition of a new metric called (memory) row reuse distance (RRD). We show that, while intra-core RRDs are relatively small (increasing the chances for row-buffer hits), inter-core RRDs are quite large (increasing the chances for row-buffer misses). Motivated by this, we propose two schemes that measure the maximum potential benefits that could be obtained from minimizing RRDs, to the extent allowed by program dependencies. Specifically, one of our schemes (Scheme-I) targets only intra-core RRDs, whereas the other one (Scheme-II) aims at reducing both intra-core RRDs and inter-core RRDs. Our experimental evaluations demonstrate that (i) Scheme-I reduces intra-core RRDs but increases inter-core RRDs; (ii) Scheme-II reduces inter-core RRDs significantly while achieving a similar behavior to Scheme-I as far as intra-core RRDs are concerned; (iii) Scheme-I and Scheme-II improve execution times of our applications by 17% and 21%, respectively, on average; and (iv) both our schemes deliver consistently good results under different memory request scheduling policies.

Funder

NSF

Intel Inc.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2796314.2745867

Reference43 articles.

1. Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning

2. The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost

3. Compiler Support for Optimizing Memory Bank-Level Parallelism

4. Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Morton filters: fast, compressed sparse cuckoo filters;The VLDB Journal;2019-08-06