Context‐aware relation enhancement and similarity reasoning for image‐text retrieval-Reference-Cited by-同舟云学术

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

Published:2024-01-30 Issue:5 Volume:18 Page:652-665
ISSN:1751-9632
Container-title:IET Computer Vision
language:en
Short-container-title:IET Computer Vision

Author:

Cui Zheng¹^ORCID,Hu Yongli¹,Sun Yanfeng¹,Yin Baocai¹

Affiliation:

1. Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Institute of Artificial Intelligence Faculty of Information Technology Beijing University of Technology Beijing China

Abstract

AbstractImage‐text retrieval is a fundamental yet challenging task, which aims to bridge a semantic gap between heterogeneous data to achieve precise measurements of semantic similarity. The technique of fine‐grained alignment between cross‐modal features plays a key role in various successful methods that have been proposed. Nevertheless, existing methods cannot effectively utilise intra‐modal information to enhance feature representation and lack powerful similarity reasoning to get a precise similarity score. Intending to tackle these issues, a context‐aware Relation Enhancement and Similarity Reasoning model, called RESR, is proposed, which conducts both intra‐modal relation enhancement and inter‐modal similarity reasoning while considering the global‐context information. For intra‐modal relation enhancement, a novel context‐aware graph convolutional network is introduced to enhance local feature representations by utilising relation and global‐context information. For inter‐modal similarity reasoning, local and global similarity features are exploited by the bidirectional alignment of image and text, and the similarity reasoning is implemented among multi‐granularity similarity features. Finally, refined local and global similarity features are adaptively fused to get a precise similarity score. The experimental results show that our effective model outperforms some state‐of‐the‐art approaches, achieving average improvements of 2.5% and 6.3% in R@sum on the Flickr30K and MS‐COCO dataset.

Publisher

Institution of Engineering and Technology (IET)

Reference57 articles.

1. Deep visual-semantic alignments for generating image descriptions

2. Deep fragment embeddings for bidirectional image sentence mapping;Karpathy A.;Adv. Neural Inf. Process. Syst.,2014

3. Hetero-Manifold Regularisation for Cross-Modal Hashing

4. VQA: Visual Question Answering

5. Kazemi V.andAli E.:Show ask attend and answer: a strong baseline for visual question answering.arXiv preprint arXiv:1704.03162(2017)