1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
2. Neural machine translation by jointly learning to align and translate;Bahdanau,2015
3. Cai, Guanyu, Zhang, Jun, Jiang, Xinyang, Gong, Yifei, He, Lianghua, Yu, Fufu, Peng, Pai, Guo, Xiaowei, Huang, Feiyue, Sun, Xing, 2021. Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval With Partial Query. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1835–1844.
4. Adaptive offline quintuplet loss for image-text matching;Chen,2020
5. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval;Chen,2020