Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]

Author:

Maniotis PavlosORCID,Kuchta Daniel M.

Abstract

We investigate the advantages of using co-packaged optics in next-generation data center and AI supercomputer networks. The increased escape bandwidth offered by co-packaged optics provides multiple possibilities for building 50T switches and beyond, expanding the opportunities in both the data center and supercomputing domains. This provides network architects with the opportunity to expand their design space and develop simplified networks with enhanced network locality properties. Co-packaging at the switch and server points enables networks with double capacity while reducing the switch count by 64% compared to state-of-the-art systems. We evaluate these concepts through discrete-event simulations using all-to-all and all-reduce traffic patterns that simulate collective communications commonly found in network-bound applications. Initially, we investigate the all-to-all overhead involved in distributing the virtual machines of the applications across multiple leaf switches and compare it to the scenario in which all VMs are placed under a single switch. Subsequently, we evaluate the performance of an AI supercomputing cluster by simulating both patterns for different message sizes, while also varying the number of participating nodes. The results suggest that networks with improved locality properties become increasingly important as the network stack operates at higher speeds; for a stack latency of 1.25 µs, placing the applications under multiple switches can result in up to 68% higher completion times than placing them under a single switch. For AI supercomputers, significant improvements are observed in the mean server throughput, reaching more than 90% for configurations involving 256 nodes and message sizes of at least 128 KiB.

Funder

Advanced Research Projects Agency - Energy

U.S. Department of Energy

Publisher

Optica Publishing Group

Subject

Computer Networks and Communications

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3