Transform-Equivariant Consistency Learning for Temporal Sentence Grounding-Reference-Cited by-同舟云学术

Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

Published:2024-01-11 Issue:4 Volume:20 Page:1-19
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Liu Daizong¹^ORCID,Qu Xiaoye²^ORCID,Dong Jianfeng³^ORCID,Zhou Pan²^ORCID,Xu Zichuan⁴^ORCID,Wang Haozhao²^ORCID,Di Xing⁵^ORCID,Lu Weining⁶^ORCID,Cheng Yu⁷^ORCID

Affiliation:

1. Peking University, China

2. Huazhong University of Science and Technology, China

3. Zhejiang Gongshang University, China

4. Dalian University of Technology, China

5. Protagolabs Inc., USA

6. Tsinghua University, China

7. The Chinese University of Hong Kong, China

Abstract

This paper addresses the temporal sentence grounding (TSG). Although existing methods have made decent achievements in this task, they not only severely rely on abundant video-query paired data for training, but also easily fail into the dataset distribution bias. To alleviate these limitations, we introduce a novel Equivariant Consistency Regulation Learning (ECRL) framework to learn more discriminative query-related frame-wise representations for each video, in a self-supervised manner. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted under various video-level transformations. Concretely, we first design a series of spatio-temporal augmentations on both foreground and background video segments to generate a set of synthetic video samples. In particular, we devise a self-refine module to enhance the completeness and smoothness of the augmented video. Then, we present a novel self-supervised consistency loss (SSCL) applied on the original and augmented videos to capture their invariant query-related semantic by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution of timestamp distance. At last, a shared grounding head is introduced to predict the transform-equivariant query-guided segment boundaries for both the original and augmented videos. Extensive experiments on three challenging datasets (ActivityNet, TACoS, and Charades-STA) demonstrate both effectiveness and efficiency of our proposed ECRL framework.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3634749

Reference88 articles.

1. Localizing Moments in Video with Natural Language

2. Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. 2020. SpeedNet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9922–9931.

3. Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, and Yuexian Zou. 2021. On pursuit of designing multi-modal transformer for video grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9810–9823.

4. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

5. Temporally Grounding Natural Sentence in Video

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Parameterized multi-perspective graph learning network for temporal sentence grounding in videos;Applied Intelligence;2024-06-24