Affiliation:
1. ETH Zurich, Switzerland
Abstract
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) Today’s programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software).
This article presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer’s insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB. But more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations sharing the same resource management layer.
Publisher
Association for Computing Machinery (ACM)
Subject
Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2;2024-04-27
2. Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit;ACM Transactions on Modeling and Performance Evaluation of Computing Systems;2020-12-31
3. Bandwidth-Aware Page Placement in NUMA;2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS);2020-05
4. Mozart : Efficient Composition of Library Functions for Heterogeneous Execution;Languages and Compilers for Parallel Computing;2019
5. Extending NUMA-BTLP Algorithm with Thread Mapping Based on a Communication Tree;Computers;2018-12-03