1. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity;Fedus;arXiv preprint,2021
2. Efficient data-plane memory scheduling for in-network aggregation;Wang;arXiv preprint,2022
3. Exploring the limits of language modeling;Jozefowicz;arXiv preprint,2016
4. Ports, and Peter Richtarik. Scaling distributed machine learning with in-network aggregation;Sapio,2021
5. Atp: In-network aggregation for multi-tenant learning;Lao,2021