TPTE: Text-guided Patch Token Exploitation for Unsupervised Fine-Grained Representation Learning-Reference-Cited by-同舟云学术

TPTE: Text-guided Patch Token Exploitation for Unsupervised Fine-Grained Representation Learning

Published:2024-08-09 Issue: Volume: Page:
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Mao Shunan¹^ORCID,Chen Hao¹^ORCID,Wang Yaowei²^ORCID,Zeng Wei¹^ORCID,Zhang Shiliang¹^ORCID

Affiliation:

1. School of Computer Science, Peking University, China

2. Peng Cheng Laboratory, China

Abstract

Recent advances in pre-trained vision-language models have successfully boosted the performance of unsupervised image representation in many vision tasks. Most of existing works focus on learning global visual features with Transformers and neglect detailed local cues, leading to suboptimal performance in fine-grained vision tasks. In this paper, we propose a text-guided patch token exploitation framework to enhance the discriminative power of unsupervised representation by exploiting more detailed local features. Our text-guided decoder extracts local features with the guidance of texts or learned prompts describing discriminative object parts. We hence introduce a local-global relation distillation loss to promote the joint optimization of local and global features. The proposed method allows to flexibly extract either global or global-local features as the image representation. It significantly outperforms previous methods in fine-grained image retrieval and base-to-new fine-grained classification tasks. For instance, our Recall@1 metric surpasses the recent unsupervised retrieval method STML by 6.0% on the SOP dataset. The code is publicly available at https://github.com/maosnhehe/TPTE.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3673657

Reference60 articles.

1. baidu. [n. d.]. https://cloud.baidu.com/product/wenxinworkshop.

2. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision. Springer, 446–461.

3. Language models are few-shot learners;Brown Tom;Advances in Neural Information Processing Systems,2020

4. Shaofei Cai, Liang Li, Xinzhe Han, Shan Huang, Qi Tian, and Qingming Huang. 2023. Semantic and Correlation Disentangled Graph Convolutions for Multilabel Image Recognition. IEEE Transactions on Neural Networks and Learning Systems (2023).

5. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.