Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning-Reference-Cited by-同舟云学术

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

Published:2022-03-04 Issue:4 Volume:18 Page:1-17
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Man Xin¹,Ouyang Deqiang²^ORCID,Li Xiangpeng¹,Song Jingkuan¹,Shao Jie¹^ORCID

Affiliation:

1. Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China

2. College of Computer Science, Chongqing University, China and Intelligent Terminal Key Laboratory of Sichuan Province, Yibin, China

Abstract

Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a “scenario understanding” module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.

Funder

National Natural Science Foundation of China

Open Fund of Intelligent Terminal Key Laboratory of Sichuan Province

Zhejiang Lab’s International Talent Fund for Young Professionals

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3503927

Reference53 articles.

1. Video Description

2. A Knowledge-Grounded Multimodal Search-Based Conversational Agent

3. LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning

4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.

5. Video Captioning with Guidance of Multimodal Latent Topics

Cited by 27 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning;2024 IEEE International Conference on Robotics and Automation (ICRA);2024-05-13

2. Continuous Image Outpainting with Neural ODE;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-04-25

3. Deep Multimodal Data Fusion;ACM Computing Surveys;2024-04-24

4. Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11

5. A Structure-Preserving and Illumination-Consistent Cycle Framework for Image Harmonization;IEEE Transactions on Multimedia;2024