1. Learning two-branch neural networks for image-text matching tasks;Wang;IEEE Trans. Pattern Anal. Mach.Intell. (TPAMI),2019
2. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
3. Hierarchical multimodal LSTM for dense visual semantic embedding;Niu,2017
4. VSE++: improved visual-semantic embeddings;Faghri,2018
5. Deep correlation for matching images and text;Yan,2015