Spatial entropy as an inductive bias for vision transformers-Reference-Cited by-同舟云学术

Spatial entropy as an inductive bias for vision transformers

Published:2024-07-17 Issue:9 Volume:113 Page:6945-6975
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Peruzzo Elia,Sangineto Enver,Liu Yahui,De Nadai Marco^ORCID,Bi Wei,Lepri Bruno,Sebe Nicu

Abstract

AbstractRecent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

Funder

Università degli Studi di Trento

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10994-024-06570-7.pdf

Reference106 articles.

1. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., et al. (2022). Flamingo: A visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716–23736.

2. Altieri, L., Cocchi, D., & Roli, G. (2018). SpatEntropy: Spatial Entropy Measures in R. arxiv:1804.05521.

3. Asano, Y. M., Rupprecht, C., & Vedaldi, A. (2020). A critical analysis of self-supervision, or what we can learn from a single image. ICLR: OpenReview.net.

4. Bachmann, R., Mizrahi, D., Atanov, A., & Zamir, A. (2022). Multimae: Multimodal multi-task masked autoencoders. ECCV (37) (Vol. 13697, pp. 348–367). Springer.

5. Bai, Y., Mei, J., Yuille, A.L., & Xie, C. (2021). Are transformers more robust than cnns? Neurips (pp. 26831–26843).