Step-Wise Hierarchical Alignment Network for Image-Text Matching-Reference-Cited by-同舟云学术

Step-Wise Hierarchical Alignment Network for Image-Text Matching

Published:2021-08 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Ji Zhong¹,Chen Kexin¹,Wang Haoran¹

Affiliation:

1. School of Electrical and Information Engineering, Tianjin University, Tianjin, China

Abstract

Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 59 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GADNet: Improving image–text matching via graph-based aggregation and disentanglement;Pattern Recognition;2025-01

2. Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval;Applied Intelligence;2024-09-11

3. SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-16

4. Encrypted Video Search with Single/Multiple Writers;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-08-16

5. Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval;Pattern Recognition;2024-07