Affiliation:
1. Lawrence Livermore National Laboratory, USA
2. Jülich Supercomputing Centre, Germany
3. RWTH Aachen University, Germany
4. Technische Universität Darmstadt, Germany
Abstract
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira, Jr., et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both forward and backward, we can identify the processes and call paths responsible for the most severe imbalances, even for runs with hundreds of thousands of processes.
Funder
G8 Research Councils Initiative on Multilateral Research
Deutsche Forschungsgemeinschaft
U.S. Department of Energy by Lawrence Livermore National Laboratory
Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues is gratefully acknowledged
Helmholtz Association of German Research Centers
Publisher
Association for Computing Machinery (ACM)
Subject
Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献