Affiliation:
1. KTH, Sweden
2. Pacific Northwest National Laboratory, USA
3. University of California at San Diego, USA
Abstract
Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups.
We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.
Funder
the DOE Office of Science Advanced Scientific Computing Research through the ARGO project
the DOE Office of Science Advanced Scientific Computing Research through the CENATE project
the European Commission through the SAGE project
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Reference18 articles.
1. {Online; accessed 15-Janurary-2017}. The Graph500 Benchmark. http://www.graph500.org/ 2017. {Online; accessed 15-Janurary-2017}. Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors. https: //colfaxresearch.com/multithreaded-transpositionof-square-matrices-with-common-code-for-intel-xeonprocessors-and-intel-xeon-phi-coprocessors/ 2017. {Online; accessed 15-Janurary-2017}. The Graph500 Benchmark. http://www.graph500.org/ 2017. {Online; accessed 15-Janurary-2017}. Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors. https: //colfaxresearch.com/multithreaded-transpositionof-square-matrices-with-common-code-for-intel-xeonprocessors-and-intel-xeon-phi-coprocessors/ 2017.
2. Analysis of scratch-pad and data-cache performance using statistical methods
3. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access
4. Data tiering in heterogeneous memory systems
Cited by
28 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Flexible and Effective Object Tiering for Heterogeneous Memory Systems;Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management;2023-06-06
2. Automatic HBM Management;Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures;2022-07-11
3. Online Application Guidance for Heterogeneous Memory Systems;ACM Transactions on Architecture and Code Optimization;2022-07-06
4. Rapid Execution Time Estimation for Heterogeneous Memory Systems Through Differential Tracing;Lecture Notes in Computer Science;2022
5. Mamba: Portable Array-based Abstractions for Heterogeneous High-Performance Systems;2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC);2021-11