1. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity;fedus;ArXiv Preprint,2021
2. Outrageously large neural net-works: The sparsely-gated mixture-of-experts layer;shazeer;ArXiv Preprint,2017
3. Scaling vision with sparse mixture of experts;riquelme;ArXiv Preprint,2021
4. Insights on Neural Representations for End-to-End Speech Recognition
5. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability;raghu;ArXiv Preprint,2017