1. Vision-and-language navigation: A survey of tasks, methods, and future directions;Gu,2023
2. A comprehensive survey on cross-modal retrieval;Wang,2016
3. Towards local visual modeling for image captioning;Ma;Pattern Recognit.,2023
4. CAAN: Context-aware attention network for visual question answering;Chen;Pattern Recognit.,2022
5. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, A. Van Den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in: IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683.