1. Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
2. VQA: Visual Question Answering
3. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation. ACL, Doha, Qatar, 103–111.
4. Long Hoang Dang, Thao Minh Le, Vuong Le, and Truyen Tran. 2021. Hierarchical object-oriented spatio-temporal reasoning for video question answering. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI-21. IJCAI, Montreal, Canada, 636–642.
5. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering