A unified schedule policy of distributed machine learning framework for CPU-GPU cluster-Reference-Cited by-同舟云学术

A unified schedule policy of distributed machine learning framework for CPU-GPU cluster

Published:2021-06 Issue:3 Volume:39 Page:529-538
ISSN:1000-2758
Container-title:Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University
language:
Short-container-title:西北工业大学学报

Author:

Zhu Ziyu,Tang Xiaochun,Zhao Quan

Abstract

With the widespread using of GPU hardware facilities, more and more distributed machine learning applications have begun to use CPU-GPU hybrid cluster resources to improve the efficiency of algorithms. However, the existing distributed machine learning scheduling framework either only considers task scheduling on CPU resources or only considers task scheduling on GPU resources. Even considering the difference between CPU and GPU resources, it is difficult to improve the resource usage of the entire system. In other words, the key challenge in using CPU-GPU clusters for distributed machine learning jobs is how to efficiently schedule tasks in the job. In the full paper, we propose a CPU-GPU hybrid cluster schedule framework in detail. First, according to the different characteristics of the computing power of the CPU and the computing power of the GPU, the data is divided into data fragments of different sizes to adapt to CPU and GPU computing resources. Second, the paper introduces the task scheduling method under the CPU-GPU hybrid. Finally, the proposed method is verified at the end of the paper. After our verification for K-Means, using the CPU-GPU hybrid computing framework can increase the performance of K-Means by about 1.5 times. As the number of GPUs increases, the performance of K-Means can be significantly improved.

Publisher

EDP Sciences

Subject

General Engineering

Link

https://www.jnwpu.org/10.1051/jnwpu/20213930529/pdf

Reference20 articles.

1. Chen T, Li M, Li Y, et al. Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems[J/OL]. (2015-12-03)[2015-12-07]. https://arxiv.org/abs/1512.01274

2. Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia, 2014: 675–678

3. Petuum: A New Platform for Distributed Machine Learning on Big Data

4. Chen L, Huo X, Agrawal G. Accelerating mapreduce on a coupled CPU-GPU architecture[C]//Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012: 1–11

5. Ravi V T, Becchi M, Jiang W, et al. Scheduling concurrent applications on a cluster of CPU-GPU nodes[C]//2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2012: 140–147

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Method of Ideological and Political Teaching Resource Accuracy Scheduling and Control Based on MVC Framework;2022 Global Reliability and Prognostics and Health Management (PHM-Yantai);2022-10-13