Flexible silicon photonic architecture for accelerating distributed deep learning
-
Published:2024-01-09
Issue:2
Volume:16
Page:A157
-
ISSN:1943-0620
-
Container-title:Journal of Optical Communications and Networking
-
language:en
-
Short-container-title:J. Opt. Commun. Netw.
Author:
Wu ZhenguoORCID,
Yuan Dai Liang,
Wang Yuyang,
Wang Songli,
Bergman KerenORCID
Abstract
The increasing size and complexity of deep learning (DL) models have led to the wide adoption of distributed training methods in datacenters (DCs) and high-performance computing (HPC) systems. However, communication among distributed computing units (CUs) has emerged as a major bottleneck in the training process. In this study, we propose Flex-SiPAC, a flexible silicon photonic accelerated compute cluster designed to accelerate multi-tenant distributed DL training workloads. Flex-SiPAC takes a co-design approach that combines a silicon photonic hardware platform with a tailored collective algorithm, optimized to leverage the unique physical properties of the architecture. The hardware platform integrates a novel wavelength-reconfigurable transceiver design and a micro-resonator-based wavelength-reconfigurable switch, enabling the system to achieve flexible bandwidth steering in the wavelength domain. The collective algorithm is designed to support reconfigurable topologies, enabling efficient all-reduce communications that are commonly used in DL training. The feasibility of the Flex-SiPAC architecture is demonstrated through two testbed experiments. First, an optical testbed experiment demonstrates the flexible routing of wavelengths by shuffling an array of input wavelengths using a custom-designed spatial-wavelength selective switch. Second, a four-GPU testbed running two DL workloads shows a 23% improvement in job completion time compared to a similarly sized leaf-spine topology. We further evaluate Flex-SiPAC using large-scale simulations, which show that Flex-SiPAC is able to reduce the communication time by 26% to 29% compared to state-of-the-art compute clusters under representative collective operations.
Funder
Advanced Research Projects Agency - Energy
National Security Agency
Center for Ubiquitous Connectivity
Semiconductor Research Corporation
Defense Advanced Research Projects Agency
Publisher
Optica Publishing Group
Subject
Computer Networks and Communications
Reference48 articles.
1. Efficient large-scale language model training on GPU clusters using Megatron-LM;Narayanan,2021
2. Attention is all you need;Vaswani,2017
3. XLNet: generalized autoregressive pretraining for language understanding;Yang,2019