1. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models;Gu,2018
2. Shuffle-then-assemble: learning object-agnostic visual relationship features;Yang,2018
3. Show control and tell: a framework for generating controllable and grounded captions;Cornia,2019
4. Entangled transformer for image captioning;Li,2019
5. Learning to collocate neural modules for image captioning;Yang,2019