Abstract
AbstractGraphics Processing Units (GPUs) perform highly efficient parallel execution for high-performance computation and embedded system domains. While performance concerns drive the main optimization efforts, power issues become important for energy-efficient GPU executions. While performance profilers and architectural simulators offer statistics about the target execution, they either present only performance metrics in a coarse kernel function level or lack visualization support that enables performance bottleneck analysis or performance-power consumption comparison. Evaluating both performance and power consumption dynamically at runtime and across GPU memory components enables a comprehensive tradeoff analysis for GPU architects and software developers. This paper presents a novel memory performance and power monitoring tool for GPU programs, GPPRMon, which performs a systematic metric collection and offers useful visualization views to track power and performance optimizations. Our simulation-based framework dynamically collects microarchitectural metrics by monitoring individual instructions and reports achieved performance and power consumption information at runtime. Our visualization interface presents spatial and temporal views of the execution. While the first demonstrates the performance and power metrics across GPU memory components, the latter shows the corresponding information at the instruction granularity in a timeline. Our case study reveals the potential usages of our tool in bottleneck identification and power consumption for a memory-intensive graph workload.
Publisher
Springer Nature Switzerland
Reference15 articles.
1. Guerreiro, J., Ilic, A., Roma, N., Tomás, P.: DVFS-aware application classification to improve GPGPUs energy efficiency. Parallel Comput. 83, 93–117 (2019)
2. Hong, J., Cho, S., Kim, G.: Overcoming memory capacity wall of GPUs with heterogeneous memory stack. IEEE Comput. Archit. Lett. 21(2), 61–64 (2022)
3. Jain, P., et al.: Checkmate: breaking the memory wall with optimal tensor rematerialization. CoRR abs/1910.02653 (2019). http://arxiv.org/abs/1910.02653
4. Jog, A., et al.: OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 395–406 (2013)
5. Kandiah, V., et al.: AccelWattch: a power modeling framework for modern GPUs. In: International Symposium on Microarchitecture (MICRO), pp. 738–753 (2021)