Multimodal few-shot classification without attribute embedding-Reference-Cited by-同舟云学术

Multimodal few-shot classification without attribute embedding

Published:2024-01-10 Issue:1 Volume:2024 Page:
ISSN:1687-5281
Container-title:EURASIP Journal on Image and Video Processing
language:en
Short-container-title:J Image Video Proc.

Author:

Chang Jun Qing^ORCID,Rajan Deepu,Vun Nicholas

Abstract

AbstractMultimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s13640-024-00620-9.pdf

Reference40 articles.

1. C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1126–1135 (2017)

2. Y. Tian, Y. Wang, D. Krishnan, J.B. Tenenbaum, P. Isola, Rethinking few-shot image classification: a good embedding is all you need? In: Proceedings of European Conference on Computer Vision, pp. 266–282 (2020)

3. A. Antoniou, H. Edwards, A. Storkey, How to train your MAML. In: International Conference on Learning Representations (2019)

4. C. Finn, K. Xu, S. Levine, Probabilistic model-agnostic meta-learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18, pp. 9537–9548 (2018)

5. F. Pahde, M.M. Puscas, J. Wolff, T. Klein, N. Sebe, M. Nabi, Low-shot learning from imaginary 3d model. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 978–985 (2019)