Affiliation:
1. Hunan University, Changsha, China
2. CVTE Inc., Guangzhou, Guangdong, China
Abstract
The newly emerging language-based video moment retrieval task aims at retrieving a target video moment from an untrimmed video given a natural language as the query. It is more applicable in reality since it is able to accurately localize a specific video moment, as compared to traditional whole video retrieval. In this work, we propose a novel solution to thoroughly investigate the language-based video moment retrieval issue under the adversarial learning. The key of our solution is to formulate the language-based video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a multi-task learning is utilized as a discriminator, which integrates inter-modal and intra-modal in a unified framework by employing a sequential update strategy. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experimental results on two challenging benchmarks, i.e., Charades-STA and TACoS datasets, have well demonstrated the effectiveness and rationality of our proposed solution. Meanwhile, on the larger and unbiased datasets, i.e., ActivityNet Captions and ActivityNet-CD, our proposed framework exhibits excellent robustness.
Funder
National Natural Science Foundation of China
Natural Science Foundation of Hunan Province
National Key Research and Development Project of China
Science and Technology Key Projects of Hunan Province
Special Funds for the Construction of Innovative Provinces in Hunan Province of China
Science and Technology Project of Changsha City
Fundamental Research Funds for the Central Universities
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12
2. Backdoor Two-Stream Video Models on Federated Learning;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12
3. Routing Evidence for Unseen Actions in Video Moment Retrieval;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24
4. Contrastive topic-enhanced network for video captioning;Expert Systems with Applications;2024-03
5. Transform-Equivariant Consistency Learning for Temporal Sentence Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11