Learning General and Specific Embedding with Transformer for Few-Shot Object Detection-Reference-Cited by-同舟云学术

Learning General and Specific Embedding with Transformer for Few-Shot Object Detection

Published:2024-08-28 Issue: Volume: Page:
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Zhang Xu,Chen Zhe,Zhang Jing,Liu Tongliang,Tao Dacheng^ORCID

Abstract

AbstractFew-shot object detection (FSOD) studies how to detect novel objects with few annotated examples effectively. Recently, it has been demonstrated that decent feature embeddings, including the general feature embeddings that are more invariant to visual changes and the specific feature embeddings that are more discriminative for different object classes, are both important for FSOD. However, current methods lack appropriate mechanisms to sensibly cooperate both types of feature embeddings based on their importance to detecting objects of novel classes, which may result in sub-optimal performance. In this paper, to achieve more effective FSOD, we attempt to explicitly encode both general and specific feature embeddings using learnable tensors and apply a Transformer to help better incorporate them in FSOD according to their relations to the input object features. We thus propose a Transformer-based general and specific embedding learning (T-GSEL) method for FSOD. In T-GSEL, learnable tensors are employed in a three-stage pipeline, encoding feature embeddings in general level, intermediate level, and specific level, respectively. In each stage, we apply a Transformer to first model the relations of the corresponding embedding to input object features and then apply the estimated relations to refine the input features. Meanwhile, we further introduce cross-stage connections between embeddings of different stages to make them complement and cooperate with each other, delivering general, intermediate, and specific feature embeddings stage by stage and utilizing them together for feature refinement in FSOD. In practice, a T-GSEL module is easy to inject. Extensive empirical results further show that our proposed T-GSEL method achieves compelling FSOD performance on both PASCAL VOC and MS COCO datasets compared with other state-of-the-art approaches.

Funder

University of Sydney

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11263-024-02199-0.pdf

Reference48 articles.

1. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450

2. Carion, N., Massa, F., Synnaeve, G., et al (2020) End-to-end object detection with transformers. In European conference on computer vision, Springer, pp. 213–229.

3. Chen, H., Wang, Y., Wang, G., et al. (2018). Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI conference on artificial intelligence.

4. Chen, Z., Zhang, J., & Tao, D. (2021). Recursive context routing for object detection. International Journal of Computer Vision, 129(1), 142–160.

5. Chen, Z., Zhang, J., & Tao, D. (2022a) Recurrent glimpse-based decoder for detection with transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5260–5269.