Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval-Reference-Cited by-同舟云学术

Multi-Level Cross-Modal Semantic Alignment Network for Video–Text Retrieval

Published:2022-09-15 Issue:18 Volume:10 Page:3346
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Nian Fudong^ORCID,Ding Ling^ORCID,Hu Yuxia,Gu Yanhong

Abstract

This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.

Funder

National Natural Science Foundation (NSF) of China

Anhui Provincial Natural Science Foundation

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/10/18/3346/pdf

Reference60 articles.

1. Comparative analysis on cross-modal information retrieval: A review

2. Learning joint embedding with multimodal cues for cross-modal video–text retrieval;Mithun;Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval,2018

3. Dual encoding for zero-example video retrieval;Dong;Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition,2019

4. Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval

5. Hierarchical cross-modal graph consistency learning for video–text retrieval;Jin;Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,2021

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network;Multimedia Systems;2024-01-22

2. Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval;Lecture Notes in Computer Science;2024