Abstract
AbstractDeep neural networks have been remarkably successful as models of the primate visual system. One crucial problem is that they fail to account for the strong shape-dependence of primate vision. Whereas humans base their judgements of category membership to a large extent on shape, deep networks rely much more strongly on other features such as color and texture. While this problem has been widely documented, the underlying reasons remain unclear. We design simple, artificial image datasets in which shape, color, and texture features can be used to predict the image class. By training networks to classify images with single features and feature combinations, we show that some network architectures are unable to learn to use shape features, whereas others are able to use shape in principle but are biased towards the other features. We show that the bias can be explained by the interactions between the weight updates for many images in mini-batch gradient descent. This suggests that different learning algorithms with sparser, more local weight changes are required to make networks more sensitive to shape and improve their capability to describe human vision.Author summaryWhen humans recognize objects, the cue they rely on most is shape. In contrast, deep neural networks mostly use local features like color and texture to classify images. We investigated how this difference arises, using images of simple shapes like rectangles and the letters L and T, combined with color and texture features. By testing different feature combinations, we show that some networks are generally unable to learn about shape, whereas others could learn to recognize shapes in isolation, but ignored shape if another feature was present. We show that this bias for color and texture arises from the way in which networks are trained: by averaging the learning signal over many images, the training algorithm favors simple features that are relatively similar in many images and removes sparser, more varied shape features. These insights can help build networks that are more sensitive to shape and work better as models of human vision.
Publisher
Cold Spring Harbor Laboratory