Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective-Reference-Cited by-同舟云学术

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Published:2023-09-07 Issue:2 Volume:132 Page:392-409
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Wu Wenhao^ORCID,Sun Zhun,Song Yuxin,Wang Jingdong,Ouyang Wanli

Abstract

AbstractTransferring knowledge from pre-trained deep models for downstream tasks, particularly with limited labeled samples, is a fundamental problem in computer vision research. Recent advances in large-scale, task-agnostic vision-language pre-trained models, which are learned with billions of samples, have shed new light on this problem. In this study, we investigate how to efficiently transfer aligned visual and textual knowledge for downstream visual recognition tasks. We first revisit the role of the linear classifier in the vanilla transfer learning framework, and then propose a new paradigm where the parameters of the classifier are initialized with semantic targets from the textual encoder and remain fixed during optimization. To provide a comparison, we also initialize the classifier with knowledge from various resources. In the empirical study, we demonstrate that our paradigm improves the performance and training speed of transfer learning tasks. With only minor modifications, our approach proves effective across 17 visual datasets that span three different data domains: image, video, and 3D point cloud.

Funder

University of Sydney

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://link.springer.com/content/pdf/10.1007/s11263-023-01876-w.pdf

Reference99 articles.

1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In ICCV (pp. 6836–6846).

2. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML, PMLR (pp. 813–824).

3. Bossard, L., Guillaumin, M., & Van Gool, L. (2014). Food-101–mining discriminative components with random forests. In ECCV.

4. Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., & Chalupka, K. (2020). Rethinking zero-shot video classification: End-to-end training for realistic applications. In CVPR (pp. 4613–4623).

5. Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., & Kim, S. (2022). Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Pattern-Expandable Image Copy Detection;International Journal of Computer Vision;2024-06-22

2. CLIP-guided Prototype Modulating for Few-shot Action Recognition;International Journal of Computer Vision;2023-10-17

3. What Can Simple Arithmetic Operations Do for Temporal Modeling?;2023 IEEE/CVF International Conference on Computer Vision (ICCV);2023-10-01