Universal Prototype Transport for Zero-Shot Action Recognition and Localization
-
Published:2023-07-19
Issue:11
Volume:131
Page:3060-3073
-
ISSN:0920-5691
-
Container-title:International Journal of Computer Vision
-
language:en
-
Short-container-title:Int J Comput Vis
Abstract
AbstractThis work addresses the problem of recognizing action categories in videos when no training examples are available. The current state-of-the-art enables such a zero-shot recognition by learning universal mappings from videos to a semantic space, either trained on large-scale seen actions or on objects. While effective, we find that universal action and object mappings are biased to specific regions in the semantic space. These biases lead to a fundamental problem: many unseen action categories are simply never inferred during testing. For example on UCF-101, a quarter of the unseen actions are out of reach with a state-of-the-art universal action model. To that end, this paper introduces universal prototype transport for zero-shot action recognition. The main idea is to re-position the semantic prototypes of unseen actions by matching them to the distribution of all test videos. For universal action models, we propose to match distributions through a hyperspherical optimal transport from unseen action prototypes to the set of all projected test videos. The resulting transport couplings in turn determine the target prototype for each unseen action. Rather than directly using the target prototype as final result, we re-position unseen action prototypes along the geodesic spanned by the original and target prototypes as a form of semantic regularization. For universal object models, we outline a variant that defines target prototypes based on an optimal transport between unseen action prototypes and object prototypes. Empirically, we show that universal prototype transport diminishes the biased selection of unseen action prototypes and boosts both universal action and object models for zero-shot classification and spatio-temporal localization.
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Reference69 articles.
1. Alexiou, I., Xiang, T., & Gong, S., (2016). Exploring synonyms as context in zero-shot action recognition. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE, pp 4190–4194 2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C., (2021). Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846 3. Arnold, A., Nallapati, R., & Cohen, WW., (2007). A comparative study of methods for transductive transfer learning. In: Seventh IEEE international conference on data mining workshops (ICDMW 2007), IEEE, pp 77–82 4. Banerjee, A., Dhillon, IS., Ghosh, J., Sra, S., & Ridgeway, G., (2005). Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research 6(9) 5. Bao, W., Yu, Q., & Kong, Y., (2022). Opental: Towards open set temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2979–2989
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection;2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW);2023-10-02 2. ReGen: A good Generative zero-shot video classifier should be Rewarded;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01
|
|