1. Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: Proceedings of The 33rd International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 48, pp. 173–182 (2016). https://proceedings.mlr.press/v48/amodei16.html
2. Baevski, A., Hsu, W.N., Conneau, A., Auli, M.: Unsupervised speech recognition. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 27826–27839 (2021). https://proceedings.neurips.cc/paper_files/paper/2021/file/ea159dc9788ffac311592613b7f71fbb-Paper.pdf
3. Baevski, A., Mohamed, A.: Effectiveness of self-supervised pre-training for ASR. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7694–7698 (2020). https://doi.org/10.1109/ICASSP40776.2020.9054224
4. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 12449–12460 (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
5. Bai, Y., Yi, J., Tao, J., Tian, Z., Wen, Z., Zhang, S.: Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1897–1911 (2021). https://doi.org/10.1109/TASLP.2021.3082299