1. Akbari, H., et al.: VATT: transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178 (2021)
2. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
3. Lecture Notes in Computer Science;L Bossard,2014
4. Lecture Notes in Computer Science;Z Cai,2016
5. Lecture Notes in Computer Science;J Cao,2020