Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos-Reference-Cited by-同舟云学术

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

Published:2019-07-17 Issue: Volume:33 Page:8393-8400
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

He Dongliang,Zhao Xiang,Huang Jizhou,Li Fu,Liu Xiao,Wen Shilei

Abstract

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos. Existing studies have adopted strategies of sliding window over the entire video or exhaustively ranking all possible clip-sentence pairs in a presegmented video, which inevitably suffer from exhaustively enumerated candidates. To alleviate this problem, we formulate this task as a problem of sequential decision making by learning an agent which regulates the temporal grounding boundaries progressively based on its policy. Specifically, we propose a reinforcement learning based framework improved by multi-task learning and it shows steady performance gains by considering additional supervised boundary information during training. Our proposed framework achieves state-of-the-art performance on ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) and Charades-STA dataset (Sigurdsson et al. 2016; Gao et al. 2017) while observing only 10 or less clips per video.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 57 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Triadic temporal-semantic alignment for weakly-supervised video moment retrieval;Pattern Recognition;2024-12

2. Context-aware relational reasoning for video chunks and frames overlapping in language-based moment localization;Neurocomputing;2024-10

3. Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

4. SgLFT: Semantic-guided Late Fusion Transformer for video corpus moment retrieval;Neurocomputing;2024-09

5. Weakly Supervised Video Moment Retrieval via Location-irrelevant Proposal Learning;Companion Proceedings of the ACM Web Conference 2024;2024-05-13