1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
2. Ba, J.L., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
3. Murel: Multimodal relational reasoning for visual question answering;Cadene,2019
4. Visual question reasoning on general dependency tree;Cao,2018
5. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding;Chen,2021