1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
2. Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. In Proceedings of the international conference on machine learning (pp. 1247–1255).
3. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval;Chen,2020
4. Uniter: Universal image-text representation learning;Chen,2020
5. Distributed attention for grounded image captioning;Chen,2021