1. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;Lu J.;33rd Conference on Neural Information Processing Systems,2019
2. H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong. 2021. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In 35th Conference on Neural Information Processing Systems.
3. A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark. 2021. Learning transferable visual models from natural language supervision. In 38th International Conference on Machine Learning.
4. A deep learning architecture of RA-DLNet for visual sentiment analysis
5. Sentiment analysis in medical settings: New opportunities and challenges