Affiliation:
1. East China Normal University,School of Computer Science and Technology,Shanghai,China
Funder
National Natural Science Foundation of China
Technology Development
Science and Technology Commission of Shanghai Municipality
East China Normal University
Reference39 articles.
1. Rethinking Benchmarks for Cross-modal Image-text Retrieval
2. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation;Li
3. Align before fuse: Vision and language representation learning with momentum distillation;Li;Advances in neural information processing systems,2021
4. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework;Wang
5. AGREE: Aligning Cross-Modal Entities for Image-Text Retrieval Upon Vision-Language Pre-trained Models