Author:
Jiang Yimin,Yan Tingfei,Yao Mingze,Wang Huibing,Liu Wenzhe
Funder
Dalian Science and Technology Bureau
Reference58 articles.
1. Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A., 2021. Noise estimation using density estimation for self-supervised multimodal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, pp. 6644–6652.
2. Neural machine translation by jointly learning to align and translate;Bahdanau,2014
3. Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C., 2022. Revisiting the” video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2917–2927.
4. Cherian, A., Hori, C., Marks, T.K., Le Roux, J., 2022. (2.5+ 1) D Spatio-Temporal Scene Graphs for Video Question Answering. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36, pp. 444–453.
5. Hierarchical object-oriented spatio-temporal reasoning for video question answering;Dang,2021