Incorporating natural language into vision models improves prediction and understanding of higher visual cortex-Reference-Cited by-同舟云学术

Incorporating natural language into vision models improves prediction and understanding of higher visual cortex

Published:2022-09-29 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Wang Aria Y.^ORCID,Kay Kendrick^ORCID,Naselaris Thomas,Tarr Michael J.^ORCID,Wehbe Leila^ORCID

Abstract

ABSTRACTWe hypothesize that high-level visual representations contain more than the representation of individual categories: they represent complex semantic information inherent in scenes that is most relevant for interaction with the world. Consequently, multimodal models such as Contrastive Language-Image Pre-training (CLIP) which construct image embeddings to best match embeddings of image captions should better predict neural responses in visual cortex, since image captions typically contain the most semantically relevant information in an image for humans. We extracted image features using CLIP, which encodes visual concepts with supervision from natural language captions. We then used voxelwise encoding models based on CLIP features to predict brain responses to real-world images from the Natural Scenes Dataset. CLIP explains up to R2 = 78% of variance in stimulus-evoked responses from individual voxels in the held out test data. CLIP also explains greater unique variance in higher-level visual areas compared to models trained only with image/label pairs (ImageNet trained ResNet) or text (BERT). Visualizations of model embeddings and Principal Component Analysis (PCA) reveal that, with the use of captions, CLIP captures both global and fine-grained semantic dimensions represented within visual cortex. Based on these novel results, we suggest that human understanding of their environment form an important dimension of visual representation.

Publisher

Cold Spring Harbor Laboratory

Reference39 articles.

1. Performance-optimized hierarchical models predict neural responses in higher visual cortex

2. Toneva, M. , Mitchell, T. M. & Wehbe, L. Combining computational controls with natural text reveals new aspects of meaning composition. bioRxiv (2020).

3. Using goal-driven deep learning models to understand sensory cortex

4. Wang, A. , Tarr, M. & Wehbe, L. Neural taskonomy: Inferring the similarity of task-derived representations from brain activity. In Wallach, H. et al. (eds.) Advances in Neural Information Processing Systems, vol. 32 (Curran Associates, Inc., 2019).

5. Gauthier, I. , James, T. , Curby, K. & Tarr, M. The influence of conceptual knowledge on visual discrimination. Cogn. Neuropsychol. 20(2003).

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Shared representations of human actions across vision and language;2023-11-06

2. Extracting and visualizing hidden activations and computational graphs of PyTorch models with TorchLens;Scientific Reports;2023-09-01

3. Human EEG and artificial neural networks reveal disentangled representations of object real-world size in natural images;2023-08-21

4. Evaluating Scoliosis Severity Based on Posturographic X-ray Images Using a Contrastive Language–Image Pretraining Model;Diagnostics;2023-06-22