1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
2. VQA: Visual question answering;Antol,2015
3. Neural machine translation by jointly learning to align and translate;Bahdanau,2015
4. G3raphground: Graph-based language grounding;Bajaj,2019
5. Mutan: Multimodal Tucker fusion for visual question answering;Ben-Younes,2017