Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning-Reference-Cited by-同舟云学术

Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning

Published:2022-02-16 Issue:2 Volume:18 Page:1-21
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Zeng Yawen¹,Cao Da¹,Lu Shaofei¹,Zhang Hanling¹,Xu Jiao²,Qin Zheng¹

Affiliation:

1. Hunan University, Changsha, China

2. CVTE Inc., Guangzhou, Guangdong, China

Abstract

The newly emerging language-based video moment retrieval task aims at retrieving a target video moment from an untrimmed video given a natural language as the query. It is more applicable in reality since it is able to accurately localize a specific video moment, as compared to traditional whole video retrieval. In this work, we propose a novel solution to thoroughly investigate the language-based video moment retrieval issue under the adversarial learning. The key of our solution is to formulate the language-based video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a multi-task learning is utilized as a discriminator, which integrates inter-modal and intra-modal in a unified framework by employing a sequential update strategy. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experimental results on two challenging benchmarks, i.e., Charades-STA and TACoS datasets, have well demonstrated the effectiveness and rationality of our proposed solution. Meanwhile, on the larger and unbiased datasets, i.e., ActivityNet Captions and ActivityNet-CD, our proposed framework exhibits excellent robustness.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Hunan Province

National Key Research and Development Project of China

Science and Technology Key Projects of Hunan Province

Special Funds for the Construction of Innovative Provinces in Hunan Province of China

Science and Technology Project of Changsha City

Fundamental Research Funds for the Central Universities

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3478025

Reference71 articles.

1. Localizing Moments in Video with Natural Language

2. Attentive Group Recommendation

3. Social-enhanced attentive group recommendation;Cao Da;IEEE Transactions on Knowledge and Data Engineering,2019

4. Video-Based Cross-Modal Recipe Retrieval

5. STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

2. Backdoor Two-Stream Video Models on Federated Learning;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

3. Routing Evidence for Unseen Actions in Video Moment Retrieval;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

4. Contrastive topic-enhanced network for video captioning;Expert Systems with Applications;2024-03

5. Transform-Equivariant Consistency Learning for Temporal Sentence Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11