DeFT-Reference-Cited by-同舟云学术

DeFT

Published:2011-07 Issue:2 Volume:8 Page:1-27
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Venkataramani Guru¹,Hughes Christopher J.²,Kumar Sanjeev³,Prvulovic Milos⁴

Affiliation:

1. The George Washington University, Washington, DC

2. Intel Corporation

3. Facebook Inc.

4. Georgia Institute of Technology

Abstract

While multicore processors promise large performance benefits for parallel applications, writing these applications is notoriously difficult. Tuning a parallel application to achieve good performance, also known as performance debugging, is often more challenging than debugging the application for correctness. Parallel programs have many performance-related issues that are not seen in sequential programs. An increase in cache misses is one of the biggest challenges that programmers face. To minimize these misses, programmers must not only identify the source of the extra misses, but also perform the tricky task of determining if the misses are caused by interthread communication (i.e., coherence misses) and if so, whether they are caused by true or false sharing (since the solutions for these two are quite different). In this article, we propose a new programmer-centric definition of false sharing misses and describe our novel algorithm to perform coherence miss classification. We contrast our approach with existing data-centric definitions of false sharing. A straightforward implementation of our algorithm is too expensive to be incorporated in real hardware. Therefore, we explore the design space for low-cost hardware support that can classify coherence misses on-the-fly into true and false sharing misses, allowing existing performance counters and profiling tools to expose and attribute them. We find that our approximate schemes achieve good accuracy at only a fraction of the cost of the ideal scheme. Additionally, we demonstrate the usefulness of our work in a case study involving a real application.

Funder

National Science Foundation

Semiconductor Research Corporation

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/1970386.1970389

Reference29 articles.

1. Parallelization Made Easier with Intel PerformanceTuning Utility

2. Bianchini R. and Kontothanassis L. 1995. Algorithms for categorizing multiprocessor communication under invalidate and update-based coherence protocols. In Tech. rep. 533 University of Rochester. Bianchini R. and Kontothanassis L. 1995. Algorithms for categorizing multiprocessor communication under invalidate and update-based coherence protocols. In Tech. rep. 533 University of Rochester.