1. Middya AI, Nag B, Roy S (2022) Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities. Knowl-Based Syst 244:108580
2. Ye, M., You, Q., Ma, F.: Qualifier: Question-guided self-attentive multimodal fusion network for audio visual scene-aware dialog. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 248–256 (2022)
3. Akbari M, Karaman S (2019) Deep multimodal representation learning for robust scene understanding. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 4137–4146
4. Zhao H, Xiong Y, Shao L (2018) Audio-visual scene recognition with multimodal attention fusion. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1092–1100
5. Owens A, Efros AA (2018) Audio-visual scene analysis with self-supervised multisensory features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 631–648