An Efficient Framework for Dense Video Captioning-Reference-Cited by-同舟云学术

An Efficient Framework for Dense Video Captioning

Published:2020-04-03 Issue:07 Volume:34 Page:12039-12046
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Suin Maitreya,Rajagopalan A. N.

Abstract

Dense video captioning is an extremely challenging task since an accurate and faithful description of events in a video requires a holistic knowledge of the video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first proposing event boundaries from a video and then captioning on a subset of the proposals. Generation of dense temporal annotations and corresponding captions from long videos can be dramatically source consuming. In this paper, we focus on the task of generating a dense description of temporally untrimmed videos and aim to significantly reduce the computational cost by processing fewer frames while maintaining accuracy. Existing video captioning methods sample frames with a predefined frequency over the entire video or use all the frames. Instead, we propose a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames. The agent needs to watch more frames when it is processing an informative part of the video, and skip frames when there is redundancy. The agent is trained using actor-critic algorithm, where the actor determines the frames to be watched from a video and the critic assesses the optimality of the decisions taken by the actor. Such an efficient frame selection simplifies the event proposal task considerably. This has the added effect of reducing the occurrence of unwanted proposals. The encoded state representation of the frame selection agent is further utilized for guiding event proposal and caption generation tasks. We also leverage the idea of knowledge distillation to improve the accuracy. We conduct extensive evaluations on ActivityNet captions dataset to validate our method.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A dense video caption dataset of student classroom behaviors and a baseline model with boundary semantic awareness;Displays;2024-09

2. Custom CNN-BiLSTM model for video captioning;Multimedia Tools and Applications;2024-07-03

3. TopicDVC: Dense Video Captioning with Topic Guidance;2024 IEEE 10th International Conference on Edge Computing and Scalable Cloud (EdgeCom);2024-06-28

4. Dense Video Captioning Based on Memory Enhanced Attention and Guided Learning;2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML);2023-11-03

5. Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods;Engineering Reports;2023-10-12