Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning-Reference-Cited by-同舟云学术

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

Published:2023-02-06 Issue:2 Volume:19 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Dong Shanshan¹^ORCID,Niu Tianzi¹^ORCID,Luo Xin¹^ORCID,Liu Wu²^ORCID,Xu Xinshun¹^ORCID

Affiliation:

1. School of Software, Shandong University, Jinan, China

2. JD AI Research, Beijing, China

Abstract

Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task very challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA ), which cooperates with the temporal attention to generate a joint attention map. Specifically, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e., MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.

Funder

National Natural Science Foundation of China

Shandong Province Key Research and Development Program

Natural Science Foundation of Shandong Province

Major Program of the National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3550276

Reference68 articles.

1. Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.12487–12496.

2. Maximiliana Behnke and Kenneth Heafield. 2020. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proc. Conf. Empirical Methods Natural Lang. Process.2664–2674.

3. Cynthia L. Bennett, Jane E. Martez, E. Mott, Edward Cutrell, and Meredith Ringel Morris. 2018. How teens with visual impairments take, edit, and share photos on social media. In Proc. CHI Conf. Hum. Factors Comput. Syst.76.

4. João Carreira and Andrew Zisserman. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4724–4733.

5. David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proc. Annu. Meeting Assoc. Comput. Linguistics. 190–200.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Action-aware Linguistic Skeleton Optimization Network for Non-autoregressive Video Captioning;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-07-20

2. Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11

3. Video Captioning by Learning from Global Sentence and Looking Ahead;ACM Transactions on Multimedia Computing, Communications, and Applications;2023-06-07

4. Semantic Enhanced Video Captioning with Multi-feature Fusion;ACM Transactions on Multimedia Computing, Communications, and Applications;2023-05-30