1. Deep audio-visual speech recognition;Afouras;IEEE Trans. Pattern Anal. Mach. Intell.,2018
2. J.L. Alcázar, F. Caba, L. Mai, F. Perazzi, J.-Y. Lee, P. Arbelaez, and B. Ghanem. Active speakers in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12465–12474, June 2020.
3. J.L. Alcázar, F. Caba, A.K. Thabet, and B. Ghanem. MAAS: Multi-modal assignation for active speaker detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 265–274, Oct. 2021.
4. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. Wav2Vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020.
5. IEMOCAP: interactive emotional dyadic motion capture database;Busso;Language Resour. Evaluat.,2008