Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval-Reference-Cited by-同舟云学术

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Published:2023-10-26 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the 31st ACM International Conference on Multimedia
language:
Short-container-title:

Author:

Bin Yi¹^ORCID,Li Haoxuan²^ORCID,Xu Yahui²^ORCID,Xu Xing²^ORCID,Yang Yang²^ORCID,Shen Heng Tao²^ORCID

Affiliation:

1. National University of Singapore, Singapore, Singapore

2. University of Electronic Science and Technology of China, Chengdu, China

Funder

Sichuan Science and Technology Program

National Natural Science Foundation of China

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3581783.3612427

Reference75 articles.

1. Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C Lawrence Zitnick , and Devi Parikh . 2015 . Vqa: Visual question answering. In ICCV. 2425--2433. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.

2. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.

3. Yi Bin Xindi Shang Bo Peng Yujuan Ding and Tat-Seng Chua. 2021. Multi-Perspective Video Captioning. In ACM Multimedia. 5110--5118. Yi Bin Xindi Shang Bo Peng Yujuan Ding and Tat-Seng Chua. 2021. Multi-Perspective Video Captioning. In ACM Multimedia. 5110--5118.

4. Yi Bin Yang Yang Jie Zhou Zi Huang and Heng Tao Shen. 2017. Adaptively attending to visual attributes and linguistic knowledge for captioning. In ACM Multimedia. 1345--1353. Yi Bin Yang Yang Jie Zhou Zi Huang and Heng Tao Shen. 2017. Adaptively attending to visual attributes and linguistic knowledge for captioning. In ACM Multimedia. 1345--1353.

5. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In SIGIR. 127--134. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In SIGIR. 127--134.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified Benchmark;IEEE Transactions on Intelligent Transportation Systems;2024-07

2. Cross-modal Consistency Learning with Fine-grained Fusion Network for Multimodal Fake News Detection;ACM Multimedia Asia 2023;2023-12-06