Rethinking the Bottom-Up Framework for Query-Based Video Localization-Reference-Cited by-同舟云学术

Rethinking the Bottom-Up Framework for Query-Based Video Localization

Published:2020-04-03 Issue:07 Volume:34 Page:10551-10558
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Chen Long,Lu Chujie,Tang Siliang,Xiao Jun,Zhang Dong,Tan Chilie,Li Xiaolin

Abstract

In this paper, we focus on the task query-based video localization, i.e., localizing a query in a long and untrimmed video. The prevailing solutions for this problem can be grouped into two categories: i) Top-down approach: It pre-cuts the video into a set of moment candidates, then it does classification and regression for each candidate; ii) Bottom-up approach: It injects the whole query content into each video frame, then it predicts the probabilities of each frame as a ground truth segment boundary (i.e., start or end). Both two frameworks have respective shortcomings: the top-down models suffer from heavy computations and they are sensitive to the heuristic rules, while the performance of bottom-up models is behind the performance of top-down counterpart thus far. However, we argue that the performance of bottom-up framework is severely underestimated by current unreasonable designs, including both the backbone and head network. To this end, we design a novel bottom-up model: Graph-FPN with Dense Predictions (GDP). For the backbone, GDP firstly generates a frame feature pyramid to capture multi-level semantics, then it utilizes graph convolution to encode the plentiful scene relationships, which incidentally mitigates the semantic gaps in the multi-scale feature pyramid. For the head network, GDP regards all frames falling in the ground truth segment as the foreground, and each foreground frame regresses the unique distances from its location to bi-directional boundaries. Extensive experiments on two challenging query-based video localization tasks (natural language video localization and video relocalization), involving four challenging benchmarks (TACoS, Charades-STA, ActivityNet Captions, and Activity-VRL), have shown that GDP surpasses the state-of-the-art top-down models.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 60 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Context-aware relational reasoning for video chunks and frames overlapping in language-based moment localization;Neurocomputing;2024-10

2. Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

3. Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement;Proceedings of the 2024 International Conference on Multimedia Retrieval;2024-05-30

4. Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning;Proceedings of the 2024 International Conference on Multimedia Retrieval;2024-05-30

5. Dual-path temporal map optimization for make-up temporal video grounding;Multimedia Systems;2024-05-03