1. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, Curran Associates, Inc. (2017). Accessed 20 Nov 2023
2. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech Recognition. arXiv, 16 May 2020. https://doi.org/10.48550/arXiv.2005.08100
3. Hsu, W.-N., Bolte, B., Tsai, Y.-H.H., et al.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
4. Baevski, A., Zhou, H., Mohamed, A., et al.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, in NIPS’20. Red Hook, NY, USA: Curran Associates Inc., December 2020, pp. 12449–12460 (2020)
5. Zhang, Y., et al., ‘Google USM: scaling automatic speech recognition beyond 100 languages’. arXiv, 24 September 2023. https://doi.org/10.48550/arXiv.2303.01037