Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]-Reference-Cited by-同舟云学术

Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]

Published:2024-01-08 Issue:2 Volume:16 Page:A143
ISSN:1943-0620
Container-title:Journal of Optical Communications and Networking
language:en
Short-container-title:J. Opt. Commun. Netw.

Author:

Maniotis Pavlos^ORCID,Kuchta Daniel M.

Abstract

We investigate the advantages of using co-packaged optics in next-generation data center and AI supercomputer networks. The increased escape bandwidth offered by co-packaged optics provides multiple possibilities for building 50T switches and beyond, expanding the opportunities in both the data center and supercomputing domains. This provides network architects with the opportunity to expand their design space and develop simplified networks with enhanced network locality properties. Co-packaging at the switch and server points enables networks with double capacity while reducing the switch count by 64% compared to state-of-the-art systems. We evaluate these concepts through discrete-event simulations using all-to-all and all-reduce traffic patterns that simulate collective communications commonly found in network-bound applications. Initially, we investigate the all-to-all overhead involved in distributing the virtual machines of the applications across multiple leaf switches and compare it to the scenario in which all VMs are placed under a single switch. Subsequently, we evaluate the performance of an AI supercomputing cluster by simulating both patterns for different message sizes, while also varying the number of participating nodes. The results suggest that networks with improved locality properties become increasingly important as the network stack operates at higher speeds; for a stack latency of 1.25 µs, placing the applications under multiple switches can result in up to 68% higher completion times than placing them under a single switch. For AI supercomputers, significant improvements are observed in the mean server throughput, reaching more than 90% for configurations involving 256 nodes and message sizes of at least 128 KiB.

Funder

Advanced Research Projects Agency - Energy

U.S. Department of Energy

Publisher

Optica Publishing Group

Subject

Computer Networks and Communications

Reference43 articles.

1. Evolutionary-scale prediction of atomic-level protein structure with a language model

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An 8 × 160 Gb s−1 all-silicon avalanche photodiode chip;Nature Photonics;2024-08-09

2. Characterization of QSFP and OSFP CPO ELS modules employing an 8-channel CWDM TOSA in practical air-cooling conditions;2024 IEEE 74th Electronic Components and Technology Conference (ECTC);2024-05-28