Abstract
Expanding the scale of GPU-based deep learning (DL) clusters would bring not only accelerated AI services but also significant energy consumption costs. In this paper, we propose a cost efficient deep learning job allocation (CE-DLA) approach minimizing the energy consumption cost for the DL cluster operation while guaranteeing the performance requirements of user requests. To do this, we first categorize the DL jobs into two classes: training jobs and inference jobs. Through the architecture-agnostic modeling, our CE-DLA approach is able to conduct the delicate mapping of heterogeneous DL jobs to GPU computing nodes. Second, we design the electricity price-aware DL job allocation so as to minimize the energy consumption cost of the cluster. We show that our approach efficiently avoids the peak-rate time slots of the GPU computing nodes by using the sophisticated mixed-integer nonlinear problem (MINLP) formulation. We additionally integrate the dynamic right-sizing (DRS) method with our CE-DLA approach, so as to minimize the energy consumption of idle nodes having no running job. In order to investigate the realistic behavior of our approach, we measure the actual output from the NVIDIA-based GPU devices with well-known deep neural network (DNN) models. Given the real trace data of the electricity price, we show that the CE-DLA approach outperforms the competitors in views of both the energy consumption cost and the performance for DL job processing.
Funder
National Research Foundation of Korea
Subject
Energy (miscellaneous),Energy Engineering and Power Technology,Renewable Energy, Sustainability and the Environment,Electrical and Electronic Engineering,Control and Optimization,Engineering (miscellaneous),Building and Construction
Reference39 articles.
1. Scaling deep learning workloads: Nvidia dgx-1/pascal and intel knights landing;Future Gener. Comput. Syst.,2020
2. (2021, December 06). NVIDIA. Available online: https://www.nvidia.com/en-us/.
3. Energy efficient scheduling of servers with multi-sleep modes for cloud data center;IEEE Trans. Cloud Comput.,2018
4. Kang, D.K., Ha, Y.G., Peng, L., and Youn, C.H. (2021). Cooperative Distributed GPU Power Capping for Deep Learning Clusters. IEEE Trans. Ind. Electron.
5. Energy portfolio optimization of data centers;IEEE Trans. Cloud Comput.,2016
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献