Affiliation:
1. School of Computer Science and Technology, North University of China, Taiyuan, China
2. School of Software and Microelectronics, Peking University, Beijing, China
3. School of Software and Microelectronics, Peking University, Beijing China
Abstract
Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g., triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g., inner relations among triplets). Therefore, we propose a
S
emantic-Consistency
E
nhanced
M
ulti-Level
Scene
Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.
Funder
National Key R&D Program of China
Publisher
Association for Computing Machinery (ACM)
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Depth Matters: Spatial Proximity-based Gaze Cone Generation for Gaze Following in Wild;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-26