1. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation;Wang,2019
2. Towards AI-complete question answering: A set of prerequisite toy tasks;Weston,2016
3. Vqa: Visual question answering;Antol,2015
4. Stacked attention networks for image question answering;Yang,2016
5. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018