Affiliation:
1. Virginia Tech
2. University of Texas at Austin and VMware Research
Abstract
In this work, we present
libHetMP
, an OpenMP runtime for automatically and transparently distributing parallel computation across heterogeneous nodes.
libHetMP
targets platforms comprising CPUs with different instruction set architectures (ISA) coupled by a high-speed memory interconnect, where cross-ISA binary incompatibility and non-coherent caches require application data be marshaled to be shared across CPUs. Because of this, work distribution decisions must take into account both relative compute performance of asymmetric CPUs and communication overheads.
libHetMP
drives workload distribution decisions without programmer intervention by measuring performance characteristics during cross-node execution. A novel HetProbe loop iteration scheduler decides if cross-node execution is beneficial and either distributes work according to the relative performance of CPUs when it is or places all work on the set of homogeneous CPUs providing the best performance when it is not. We evaluate
libHetMP
using compute kernels from several OpenMP benchmark suites and show a geometric mean 41% speedup in execution time across asymmetric CPUs. Because some workloads may showcase irregular behavior among iterations, we extend
libHetMP
with a second scheduler called HetProbe-I. The evaluation of HetProbe-I shows it can further improve speedup for irregular computation, in some cases up to a 24%, by triggering periodic distribution decisions.
Funder
US Office of Naval Research
NAVSEA/NEEC
Publisher
Association for Computing Machinery (ACM)
Reference56 articles.
1. 2017. PCI Express Base Specification Revision 4.0 Version 1.0. Retrieved from https://pcisig.com/specifications/pciexpress/.
2. 2018. Summit: A Supercomputer Suited for AI. Retrieved from https://www.olcf.ornl.gov/wp-content/uploads/2018/06/NODE_infographic_FIN.pdf.
3. AMD. 2020. AMD Infinity Architecture Technology. Retrieved from https://www.amd.com/en/technologies/infinity-architecture.
4. TreadMarks: shared memory computing on networks of workstations
5. Anandtech. 2019. Intel Agilex: 10nm FPGAs with PCIe 5.0 DDR5 and CXL. Retrieved from https://www.anandtech.com/show/14149/intel-agilex-10nm-fpgas-with-pcie-50-ddr5-and-cxl.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献