1. Star-transformer: a spatiotemporal cross attention transformer for human action recognition;D Ahn;Proceedings of IEEE Winter Conference on Applications of Computer Vision,2023
2. Self-supervised multimodal versatile networks;J B Alayrac;Proceedings of Advances in Neural Information Processing Systems,2020
3. Self-supervised learning by cross-modal audio-video clustering;H Alwassel;Neural Information Processing Systems,2020
4. Vivit: A video vision transformer;A Arnab;Proceedings of IEEE International Conference on Computer Vision (ICCV),2021
5. Multimodal machine learning: A survey and taxonomy;T Baltru�aitis;IEEE Transactions on Pattern Analysis and Machine Intelligence,2018