Distributed deep learning training using silicon photonic switched architectures-Reference-Cited by-同舟云学术

Distributed deep learning training using silicon photonic switched architectures

Published:2022-03-01 Issue:3 Volume:7 Page:030901
ISSN:2378-0967
Container-title:APL Photonics
language:en
Short-container-title:APL Photonics

Author:

Zhu Ziyi¹^ORCID,Teh Min Yee¹,Wu Zhenguo¹,Glick Madeleine Strom¹,Yan Shijia¹,Hattink Maarten¹,Bergman Keren¹

Affiliation:

1. Department of Electrical Engineering, Columbia University, New York, New York 10027, USA

Abstract

The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.

Publisher

AIP Publishing

Subject

Computer Networks and Communications,Atomic and Molecular Physics, and Optics

Link

https://aip.scitation.org/doi/pdf/10.1063/5.0070711

Reference55 articles.

1. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 (2014).

2. J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805 (2018).

3. Deep Neural Networks for YouTube Recommendations

4. J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. Patwary, M. Ali, Y. Yang, and Y. Zhou, “Deep learning scaling is predictable, empirically,” arXiv:1712.00409 (2017).

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient all-to-all Collective Communication Schedules for Direct-connect Topologies;Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing;2024-06-03

2. Fast and scalable all-optical network architecture for distributed deep learning;Journal of Optical Communications and Networking;2024-02-22

3. On the Performance Investigation of a Recursive Fast Optical Switch-Based High Performance Computing Network Architecture;IEEE/ACM Transactions on Networking;2023

4. Efficient neural network accelerators with optical computing and communication;Computer Science and Information Systems;2023

5. Photonic switch fabrics in data center/high-performance computing networks;Integrated Photonics for Data Communication Applications;2023