Semantic Completion and Filtration for Image–Text Retrieval-Reference-Cited by-同舟云学术

Semantic Completion and Filtration for Image–Text Retrieval

Published:2023-02-27 Issue:4 Volume:19 Page:1-20
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Yang Song¹^ORCID,Li Qiang²^ORCID,Li Wenhui²^ORCID,Li Xuan-Ya³^ORCID,Jin Ran⁴^ORCID,Lv Bo⁵^ORCID,Wang Rui⁶^ORCID,Liu Anan¹^ORCID

Affiliation:

1. Tianjin University; China and also with the Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China

2. Tianjin University, China

3. Baidu Inc., Beijing, China

4. Zhejiang Wanli University, Ningbo, China

5. The 30th Research Institute of China Electronics Technology Group Corporation, ChengDu, China

6. The 30th Research Institute of China Electronics Technology Group Corporation, China

Abstract

Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations and eliminating the large gap between visual and textual domains. Over the past few decades, although many works have made significant progress in image–text retrieval, they are still confronted with the challenge of incomplete text descriptions of images, i.e., how to fully learn the correlations between relevant region–word pairs with semantic diversity. In this article, we propose a novel semantic completion and filtration (SCAF) method to alleviate the above issue. Specifically, the text semantic completion module is presented to generate a complete semantic description of an image using multi-view text descriptions, guiding the model to explore the correlations of relevant region–word pairs fully. Meanwhile, the adaptive structural semantic matching module is presented to filter irrelevant region–word pairs by considering the relevance score of each region–word pair, which facilitates the model to focus on learning the relevance of matching pairs. Extensive experiments show that our SCAF outperforms the existing methods on Flickr30K and MSCOCO datasets, which demonstrates the superiority of our proposed method.

Funder

National Natural Science Foundation of China

China Postdoctoral Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3572844

Reference54 articles.

1. VQA: Visual Question Answering

2. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment;Barzilay Regina;CoRR,2003

3. Ali Furkan Biten, Lluís Gómez, Marçal Rusiñol, and Dimosthenis Karatzas. 2019. Good news, everyone! Context driven entity-aware captioning for news images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12466–12475.

4. Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12652–12660.

5. Empirical evaluation of gated recurrent neural networks on sequence modeling;Chung Junyoung;CoRR,2014

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

2. A method for image–text matching based on semantic filtering and adaptive adjustment;EURASIP Journal on Image and Video Processing;2024-08-29

3. Multi-view and region reasoning semantic enhancement for image-text retrieval;Multimedia Systems;2024-06-15

4. Object search by a concept-conditioned object detector;Neural Computing and Applications;2024-05-20

5. Universal Relocalizer for Weakly Supervised Referring Expression Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-05-16