Affiliation:
1. North Carolina State University, Raleigh, NC
2. Lawrence Livermore National Laboratory, Livermore, CA
Abstract
Application performance on high-performance shared-memory systems is often limited by sharing patterns resulting in cache-coherence bottlenecks. Current approaches to identify coherence bottlenecks incur considerable run-time overhead and do not scale. We present two novel hardware-assisted coherence-analysis techniques that reduce trace sizes by two orders of magnitude over full traces. First, hardware performance monitoring is combined with capturing stores in software to provide a lossy-trace mechanism, which is an order of magnitude faster than software-instrumentation-based full-tracing and retains accuracy. Second, selected long-latency loads are instrumented via binary rewriting, which provides even higher accuracy and control over tracing, but requires additional overhead.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference34 articles.
1. The NAS Parallel Benchmarks;Bailey D. H.;The International Journal of Supercomputer Applications,1991
2. An API for Runtime Code Patching
3. Buck B. R. and Hollingsworth J. K. 2000b. Using hardware performance monitors to isolate memory bottlenecks. In Supercomputing. ACM New York. 64--65. Buck B. R. and Hollingsworth J. K. 2000b. Using hardware performance monitors to isolate memory bottlenecks. In Supercomputing. ACM New York. 64--65.
4. Buck B. R. and Hollingsworth J. K. 2004. Data centric cache measurement on the intel itanium 2 processor. In Supercomputing ACM New York. 10.1109/SC.2004.21 Buck B. R. and Hollingsworth J. K. 2004. Data centric cache measurement on the intel itanium 2 processor. In Supercomputing ACM New York. 10.1109/SC.2004.21
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献