1. Rasiwasia N, Costa Pereira J, Coviello E, A new approach to cross-modal multimedia retrieval[C]. Proceedings of the 18th ACM international conference on Multimedia. 2010: 251-260.
2. Jia Y, Salzmann M, Darrell T. Learning cross-modality similarity for multinomial data[C]. 2011 international conference on computer vision. IEEE, 2011: 2407-2414.
3. Fartash F, Fleet D, Kiros J, VSE++: Improved visual semantic embeddings[C]. British Machine Vision Conference. 2018: 935-943.
4. Gu J, Cai J, Joty S R, Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models[C]. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7181-7189.
5. Zheng Z, Zheng L, Garrett M, Dual-path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2020, 16(2): 1-23.