1. Transformer models for text-based emotion detection: a review of BERT-based approaches;Acheampong;Artificial Intelligence Review,2021
2. SiT: Self-supervised vision transformer;Ahmed,2021
3. Akbari, H., Yuan, L., Qian, R., Chuang, W., Chang, S., Cui, Y., et al. (2021). VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in neural information processing systems 34: annual conference on neural information processing systems, NeurIPS, December 6-14, 2021, virtual (pp. 24206–24221).
4. VQA: visual question answering;Antol,2015
5. XLS-R: self-supervised cross-lingual speech representation learning at scale;Babu,2022