Abstract
Creating efficient, scalable dynamic parallel runtime systems for chip multiprocessors (CMPs) requires understanding the overheads that manifest at high core counts and small task sizes.
In this article, we assess these overheads on Intel's Threading Building Blocks (TBB) and OpenMP. First, we use real hardware and simulations to detail various scheduler and synchronization overheads. We find that these can amount to 47% of TBB benchmark runtime and 80% of OpenMP benchmark runtime. Second, we propose load balancing techniques such as occupancy-based and criticality-guided task stealing, to boost performance.
Overall, our study provides valuable insights for creating robust, scalable runtime libraries.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference39 articles.
1. The data locality of work stealing
2. Allen A. Chase D. Luchangco V. Maesen J. W. Ryu S. Steele G. and Tobin-Hochstadt S. 2006. The Fortress Language specification. Sun Microsystems. Allen A. Chase D. Luchangco V. Maesen J. W. Ryu S. Steele G. and Tobin-Hochstadt S. 2006. The Fortress Language specification. Sun Microsystems.
3. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors
4. The PARSEC benchmark suite
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献