A combined priority scheduling method for distributed machine learning-Reference-Cited by-同舟云学术

A combined priority scheduling method for distributed machine learning

Published:2023-05-29 Issue:1 Volume:2023 Page:
ISSN:1687-1499
Container-title:EURASIP Journal on Wireless Communications and Networking
language:en
Short-container-title:J Wireless Com Network

Author:

Du TianTian,Xiao GongYi,Chen Jing^ORCID,Zhang ChuanFu,Sun Hao,Li Wen,Geng YuDong

Abstract

AbstractAlgorithms and frameworks for distributed machine learning have been widely used in numerous artificial intelligence engineering applications. A cloud platform provides a large number of resources at a lower cost and is a more convenient method for such applications. With the rapid development of containerization, native cloud combinations based on Docker and Kubernetes have provided effective resource support for distributed machine learning. However, native Kubernetes does not provide efficient priority or fair resource scheduling strategies for distributed machine learning in computationally intensive and time-consuming jobs, which easily leads to resource deadlock, resource waste, and low job execution efficiency. Therefore, to utilize the execution order between multiple jobs in distributed machine learning as well as the dependencies between multiple tasks for the same job, considering intra- and inter-group scheduling priorities, a combined priority scheduling method is proposed for distributed machine learning based on Kubernetes and Volcano. Considering the user priority, task priority, longest wait time, task parallelism, and affinity and non-affinity between the parameter server and worker nodes, a combined priority scheduling model of inter- and intra-job priority is proposed, which is mapped into a scheduling strategy of inter- and intra-group priorities of pods, enabling the efficient scheduling and training of distributed machine learning. The experiment results show that the proposed method achieves preferential resource allocation for urgent, high parallelism, and high-priority jobs with high-priority users and improves the job execution efficiency. The affinity and anti-affinity settings among pods reduce the time of information interaction between the parameter server and worker nodes to a certain extent, thereby improving the job completion efficiency. This group scheduling strategy alleviates the problems of resource deadlock and waste caused by insufficient resources in cloud computing.

Funder

Natural Science Foundation of Shandong Province

Key Technology Research and Development Program of Shandong

Qilu University of Technology

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Computer Science Applications,Signal Processing

Link

https://link.springer.com/content/pdf/10.1186/s13638-023-02253-4.pdf

Reference41 articles.

1. A. Mahmoodzadeh, H.R. Nejati, M. Mohammadi et al., Prediction of mode-I rock fracture toughness using support vector regression with metaheuristic optimization algorithms. Eng. Fract. Mech. 264, 108334 (2022)