1. S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, S. Vijayanarasimhan, Youtube-8m: A large-scale video classification benchmark. (2016). arXiv preprint arXiv:1609.08675
2. T. Afouras, J.S. Chung, A. Zisserman, Lrs3-ted: A large-scale dataset for visual speech recognition. (2018). arXiv preprint arXiv:1809.00496
3. S. Allen. How many videos are on YouTube? 33+ interesting stats. (2023). https://www.nichepursuits.com/how-many-videos-are-on-youtube/. Accessed 17 Dec 2023
4. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F.M. Tyers, G. Weber, Common voice: A massively-multilingual speech corpus. Proceedings of the Twelfth Language Resources and Evaluation Conference. (European Language Resources Association, Marseille, 2020), p. 4218–4222
5. A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)