Abstract
The first steps of visual processing are often described as a bank of oriented filters followed by divisive normalization. This approach has been tremendously successful at predicting contrast thresholds in simple visual displays. However, it is unclear to what extent this kind of architecture also supports processing in more complex visual tasks performed in naturally looking images.We used a deep generative image model to embed arc segments with different curvatures in naturalistic images. These images contain the target as part of the image scene, resulting in considerable appearance variation of target as well as background. Three observers localized arc targets in these images, achieving an accuracy of 74.7% correct responses on average. Data were fit by several biologically inspired models, 4 standard deep convolutional neural networks (CNN) from the computer vision literature, and by a 5-layer CNN specifically trained for this task. Four models were particularly good at predicting observer responses, (i) a bank of oriented filters, similar to complex cells in primate area V1, (ii) a bank of oriented filters followed by tuned gain control, incorporating knowledge about cortical surround interactions, (iii) a bank of oriented filters followed by local normalization, (iv) the 5-layer specifically trained CNN. A control experiment with optimized stimuli based on these four models showed that the observers’ data were best explained by model (ii) with tuned gain control.These data suggest that standard models of early vision provide good descriptions of performance in much more complex tasks than what they were designed for, while general purpose non-linear models such as convolutional neural networks do not.
Publisher
Cold Spring Harbor Laboratory