Affiliation:
1. Center for Future Media, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
2. College of Computer Science, Chongqing University, China and Intelligent Terminal Key Laboratory of Sichuan Province, Yibin, China
Abstract
Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a “scenario understanding” module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.
Funder
National Natural Science Foundation of China
Open Fund of Intelligent Terminal Key Laboratory of Sichuan Province
Zhejiang Lab’s International Talent Fund for Young Professionals
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Reference53 articles.
1. Video Description
2. A Knowledge-Grounded Multimodal Search-Based Conversational Agent
3. LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning
4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
5. Video Captioning with Guidance of Multimodal Latent Topics
Cited by
27 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献