1. Multimodal machine learning: a survey and taxonomy;Baltruaitis;IEEE Trans. Pattern Anal. Mach. Intell.,2019
2. Show, attend and tell: Neural image caption generation with visual attention;Xu,2015
3. Vqa: visual question answering;Antol,2015
4. Learning deep representations of fine-grained visual descriptions;Reed,2016
5. Stacked attention networks for image question answering;Yang,2016