1. Bahdanau, D., et al., 2015. Neural machine translation by jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (Eds.), Proceedings of the International Conference on Learning Representations.
2. Bai, Y., Wang, J., Long, Y., et al., 2021. Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3556–3564.
3. Banerjee, S., Lavie, A., 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the Annual Meeting on Association for Computational Linguistics Workshop, pp. 65–72.
4. Brody, S., Alon, U., Yahav, E., 2022. How attentive are graph attention networks?. In: Proceedings of the International Conference on Learning Representations.
5. Cao, S., Wang, B., Zhang, W., Ma, L., 2022. Visual consensus modeling for video-text retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 167–175.