Boundary Proposal Network for Two-stage Natural Language Video Localization-Reference-Cited by-同舟云学术

Boundary Proposal Network for Two-stage Natural Language Video Localization

Published:2021-05-18 Issue:4 Volume:35 Page:2986-2994
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Xiao Shaoning,Chen Long,Zhang Songyang,Ji Wei,Shao Jian,Ye Lu,Xiao Jun

Abstract

We aim to address the problem of Natural Language Video Localization (NLVL) — localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 66 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

2. Fuzzy Multimodal Graph Reasoning for Human-Centric Instructional Video Grounding;IEEE Transactions on Fuzzy Systems;2024-09

3. Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding;Entropy;2024-08-27

4. Routing Evidence for Unseen Actions in Video Moment Retrieval;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

5. Weakly Supervised Video Moment Retrieval via Location-irrelevant Proposal Learning;Companion Proceedings of the ACM Web Conference 2024;2024-05-13