Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems-Reference-Cited by-同舟云学术

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

Published:2023-10-25 Issue:1 Volume:19 Page:1-20
ISSN:1552-6283
Container-title:International Journal on Semantic Web and Information Systems
language:ng
Short-container-title:

Author:

Xu Yifang¹^ORCID,Sun Yunzhuo²,Xie Zien¹,Zhai Benxiang¹,Jia Youyao³,Du Sidan¹

Affiliation:

1. School of Electronic Science and Engineering, Nanjing University, China

2. School of Physics and Electronics, Hubei Normal University, China

3. Gosuncn Chuanglian Technology Co., Ltd., Guangzhou, China

Abstract

With the surge in online video content, finding highlights and key video segments have garnered widespread attention. Given a textual query, video highlight detection (HD) and temporal grounding (TG) aim to predict frame-wise saliency scores from a video while concurrently locating all relevant spans. Despite recent progress in DETR-based works, these methods crudely fuse different inputs in the encoder, which limits effective cross-modal interaction. To solve this challenge, the authors design QD-Net (query-guided refinement and dynamic spans network) tailored for HD&TG. Specifically, they propose a query-guided refinement module to decouple the feature encoding from the interaction process. Furthermore, they present a dynamic span decoder that leverages learnable 2D spans as decoder queries, which accelerates training convergence for TG. On QVHighlights dataset, the proposed QD-Net achieves 61.87 HD-HIT@1 and 61.88 TG-mAP@0.5, yielding a significant improvement of +1.88 and +8.05, respectively, compared to the state-of-the-art method.

Publisher

IGI Global

Subject

Computer Networks and Communications,Information Systems

Reference52 articles.

1. Video Features with Impact on User Quality of Experience

2. Efficient Local Cloud-Based Solution for Liver Cancer Detection Using Deep Learning

3. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. ArXiv Prepr. ArXiv160706450.

4. Joint Visual and Audio Learning for Video Highlight Detection

5. On Pursuit of Designing Multi-modal Transformer for Video Grounding

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT;Applied Sciences;2024-02-25