SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval-Reference-Cited by-同舟云学术

SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval

Published:2024-08-16 Issue:8 Volume:20 Page:1-28
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Liu Yuankun¹^ORCID,Yuan Xiang²^ORCID,Li Haochen³^ORCID,Tan Zhijie³^ORCID,Huang Jinsong³^ORCID,Xiao Jingjie³^ORCID,Li Weiping³^ORCID,Mo Tong³^ORCID

Affiliation:

1. School of Computer Science and Technology, North University of China, Taiyuan, China

2. School of Software and Microelectronics, Peking University, Beijing, China

3. School of Software and Microelectronics, Peking University, Beijing China

Abstract

Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g., triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g., inner relations among triplets). Therefore, we propose a S emantic-Consistency E nhanced M ulti-Level Scene Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.

Funder

National Key R&D Program of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3664816

Reference50 articles.

1. SPICE: Semantic Propositional Image Caption Evaluation

2. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

3. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations.

4. Fei-Long Chen Du-Zhen Zhang Ming-Lun Han Xiu-Yi Chen Jing Shi Shuang Xu and Bo Xu. 2023. Vlp: A survey on vision-language pre-training. Machine Intelligence Research 20 1 (2023) 38–56.

5. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Depth Matters: Spatial Proximity-based Gaze Cone Generation for Gaze Following in Wild;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-26