Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment-Reference-Cited by-同舟云学术

Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment

Published:2023-04 Issue:4 Volume:34 Page:1281-1293
ISSN:1045-9219
Container-title:IEEE Transactions on Parallel and Distributed Systems
language:
Short-container-title:IEEE Trans. Parallel Distrib. Syst.

Author:

Zhang Shiwei¹^ORCID,Yi Xiaodong¹^ORCID,Diao Lansong²,Wu Chuan¹^ORCID,Wang Siyu²,Lin Wei²

Affiliation:

1. Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong

2. Alibaba Group, Hangzhou, Zhejiang, China

Funder

Alibaba Group

Hong Kong RGC

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Subject

Computational Theory and Mathematics,Hardware and Architecture,Signal Processing

Link

http://xplorestaging.ieee.org/ielx7/71/10043602/10040900.pdf?arnumber=10040900

Reference55 articles.

1. Nccl 2.0;jeaugey;Proc GPU Technol Conf,2017

2. Horovod: Fast and easy distributed deep learning in tensorflow;sergeev,2018

3. A unified architecture for accelerating distributed {DNN} training in heterogeneous GPU/CPU clusters;jiang;Proc 14th USENIX Symp Operating Syst Des Implementation,2020

4. Scaling Distributed Machine Learning with the Parameter Server

5. Post: Device placement with cross-entropy minimization and proximal policy optimization;gao;Proc Adv Neural Inf Process Syst,2018

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis;Proceedings of the Nineteenth European Conference on Computer Systems;2024-04-22

2. Reliable data transmission for a VANET-IoIT architecture: A DNN approach;Internet of Things;2024-04

3. Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environment;Computer Communications;2024-02