Affiliation:
1. University of California, Riverside
Abstract
Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy.
In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters.
We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.
Funder
Division of Computing and Communication Foundations
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference29 articles.
1. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
2. Barker Z. and Prasanna V. 2005. Efficient hardware data mining with the apriori algorithm on fpgas. http://gridsec.usc.edu/files/TR/zbakerUSCfccm05.pdf. 10.1109/FCCM.2005.31 Barker Z. and Prasanna V. 2005. Efficient hardware data mining with the apriori algorithm on fpgas. http://gridsec.usc.edu/files/TR/zbakerUSCfccm05.pdf. 10.1109/FCCM.2005.31
3. A performance study of general-purpose applications on graphics processors using CUDA
Cited by
71 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献