Guided Graph Attention Learning for Video-Text Matching-Reference-Cited by-同舟云学术

Guided Graph Attention Learning for Video-Text Matching

Published:2022-06-30 Issue:2s Volume:18 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Li Kunpeng¹^ORCID,Liu Chang¹^ORCID,Stopa Mike²^ORCID,Amano Jun²^ORCID,Fu Yun¹^ORCID

Affiliation:

1. Northeastern University, Boston, Massachusetts, USA

2. Konica Minolta, San Mateo, California, USA

Abstract

As a bridge between videos and natural languages, video-text matching has been a hot multimedia research topic in recent years. Such cross-modal retrieval is usually achieved by learning a common embedding space where videos and text captions are directly comparable. It is still challenging because existing visual representations cannot exploit semantic correlations within videos well, resulting in a mismatch with semantic concepts that are contained in the corresponding text descriptions. In this article, we propose a new Guided Graph Attention Learning (GGAL) model to enhance video embedding learning by capturing important region-level semantic concepts within the spatiotemporal space. Our model builds connections between object regions and performs hierarchical graph reasoning on both frame-level and whole video–level region graphs. During this process, global context is used to guide attention learning on this hierarchical graph topology so that the learned overall video embedding can focus on essential semantic concepts and can be better aligned with text captions. Experiments on commonly used benchmarks validate that GGAL outperforms many recent video-text retrieval methods with a clear margin. As multimedia data in dynamic environments becomes critically important, we also validate GGAL learned video-text representations that can be generalized well to unseen out-of-domain data via cross-dataset evaluations. To further investigate the interpretability of our model, we visualize attention weights learned by GGAL models. We find that GGAL successfully focuses on key semantic concepts in the video and has complementary attention on the context parts based on different ways of building region graphs.

Funder

Konica Minolta research funding

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3538533

Reference94 articles.

1. Watch your step: Learning node embeddings via graph attention;Abu-El-Haija Sami;Advances in Neural Information Processing Systems (NeurIPS’18),2018

2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.

3. Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, and Jianbo Shi. 2017. Convolutional random walk networks for semantic image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). 858–866.

4. Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. Murel: Multimodal relational reasoning for visual question answering. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 1989–1998.

5. Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. 2015. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’15). 2956–2964.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12