1. Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters. In Proceedings of 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). Santa Clara, CA, 1403–1420.
2. Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtarik. 2021. Scaling Distributed Machine Learning with In-Network Aggregation. In Proceedings of 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI’21). 785–808.
3. Weiyan Wang, Cengguang Zhang, Liu Yang, Kai Chen, and Kun Tan. 2022. Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training. In IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, London, United Kingdom, May 2-5, 2022. IEEE, 320–329.
4. ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. 2021. ATP: In-network Aggregation for Multi-tenant Learning. In Proceedings of 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI’21). 741–761.
5. A2TP: Aggregator-aware In-network Aggregation for Multi-tenant Learning