Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
-
Published:2023-09-07
Issue:2
Volume:132
Page:392-409
-
ISSN:0920-5691
-
Container-title:International Journal of Computer Vision
-
language:en
-
Short-container-title:Int J Comput Vis
Author:
Wu WenhaoORCID, Sun Zhun, Song Yuxin, Wang Jingdong, Ouyang Wanli
Abstract
AbstractTransferring knowledge from pre-trained deep models for downstream tasks, particularly with limited labeled samples, is a fundamental problem in computer vision research. Recent advances in large-scale, task-agnostic vision-language pre-trained models, which are learned with billions of samples, have shed new light on this problem. In this study, we investigate how to efficiently transfer aligned visual and textual knowledge for downstream visual recognition tasks. We first revisit the role of the linear classifier in the vanilla transfer learning framework, and then propose a new paradigm where the parameters of the classifier are initialized with semantic targets from the textual encoder and remain fixed during optimization. To provide a comparison, we also initialize the classifier with knowledge from various resources. In the empirical study, we demonstrate that our paradigm improves the performance and training speed of transfer learning tasks. With only minor modifications, our approach proves effective across 17 visual datasets that span three different data domains: image, video, and 3D point cloud.
Funder
University of Sydney
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Reference99 articles.
1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In ICCV (pp. 6836–6846). 2. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML, PMLR (pp. 813–824). 3. Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV. 4. Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., & Chalupka, K. (2020). Rethinking zero-shot video classification: End-to-end training for realistic applications. In CVPR (pp. 4613–4623). 5. Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., & Kim, S. (2022). Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Pattern-Expandable Image Copy Detection;International Journal of Computer Vision;2024-06-22 2. CLIP-guided Prototype Modulating for Few-shot Action Recognition;International Journal of Computer Vision;2023-10-17 3. What Can Simple Arithmetic Operations Do for Temporal Modeling?;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01
|
|