Refining HPCToolkit for application performance analysis at exascale-Reference-Cited by-同舟云学术

Refining HPCToolkit for application performance analysis at exascale

Published:2024-08-30 Issue: Volume: Page:
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Adhianto Laksono¹^ORCID,Anderson Jonathon¹^ORCID,Barnett Robert Matthew¹^ORCID,Grbic Dragana¹^ORCID,Indic Vladimir²^ORCID,Krentel Mark¹^ORCID,Liu Yumeng¹^ORCID,Milaković Srđan¹^ORCID,Phan Wileam¹^ORCID,Mellor-Crummey John¹^ORCID

Affiliation:

1. Department of Computer Science, Rice University, Houston, TX, USA

2. Faculty of Technical Sciences, University of Novi Sad, Novi Sad, Serbia

Abstract

As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently collect performance measurements of GPU-accelerated applications, HPCToolkit employs novel non-blocking data structures to communicate performance measurements between tool threads and application threads. To attribute performance information in detail to source lines, loop nests, and inlined call chains, HPCToolkit performs parallel analysis of large CPU and GPU binaries involved in the execution of an exascale application to rapidly recover mappings between machine instructions and source code. To analyze terabytes of performance measurements gathered during executions at exascale, HPCToolkit employs distributed-memory parallelism, multithreading, sparse data structures, and out-of-core streaming analysis algorithms. To support interactive exploration of profiles up to terabytes in size, HPCToolkit’s hpcviewer graphical user interface uses out-of-core methods to visualize performance data. The result of these efforts is that HPCToolkit now supports collection, analysis, and presentation of profiles and traces of GPU-accelerated applications at exascale. These improvements have enabled HPCToolkit to efficiently measure, analyze and explore terabytes of performance data for executions using as many as 64K MPI ranks and 64K GPU tiles on ORNL’s Frontier supercomputer. HPCToolkit’s support for measurement and analysis of GPU-accelerated applications has been employed to study a collection of open-science applications developed as part of ECP. This paper reports on these experiences, which provided insight into opportunities for tuning applications, strengths and weaknesses of HPCToolkit itself, as well as unexpected behaviors in executions at exascale.

Funder

Lawrence Livermore National Laboratory

Argonne National Laboratory

Office of Science

Total Energies E&P Research & Technology USA, LLC

Advanced Micro Devices

National Nuclear Security Administration

Intel Corporation

Publisher

SAGE Publications

Link

https://journals.sagepub.com/doi/pdf/10.1177/10943420241277839

Reference75 articles.

1. Abadi M, Agarwal A, Barham P, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/