A Multimodal Vision Transformer for Interpretable Fusion of Functional and Structural Neuroimaging Data-Reference-Cited by-同舟云学术

A Multimodal Vision Transformer for Interpretable Fusion of Functional and Structural Neuroimaging Data

Published:2023-07-18 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bi Yuda,Abrol Anees,Fu Zening,Calhoun Vince D.^ORCID

Abstract

AbstractDeep learning models, despite their potential for increasing our understanding of intricate neuroimaging data, can be hampered by challenges related to interpretability. Multimodal neuroimaging appears to be a promising approach that allows us to extract supplementary information from various imaging modalities. It’s noteworthy that functional brain changes are often more pronounced in schizophrenia, albeit potentially less reproducible, while structural MRI effects are more replicable but usually manifest smaller effects. Instead of conducting isolated analyses for each modality, the joint analysis of these data can bolster the effects and further refine our neurobiological understanding of schizophrenia. This paper introduces a novel deep learning model, the multimodal vision transformer (MultiViT), specifically engineered to enhance the accuracy of classifying schizophrenia by using structural MRI (sMRI) and functional MRI (fMRI) data independently and simultaneously leveraging the combined information from both modalities. This study uses functional network connectivity data derived from a fully automated independent component analysis method as the fMRI features and segmented gray matter volume (GMV) as the sMRI features. These offer sensitive, high-dimensional features for learning from structural and functional MRI data. The resulting MultiViT model is lightweight and robust, outperforming unimodal analyses. Our approach has been applied to data collected from control subjects and patients with schizophrenia, with the MultiViT model achieving an AUC of 0.833, which is significantly higher than the average 0.766 AUC for unimodal baselines and 0.78 AUC for multimodal baselines. Advanced algorithmic approaches for predicting and characterizing these disorders have consistently evolved, though subject and diagnostic heterogeneity pose significant challenges. Given that each modality provides only a partial representation of the brain, we can gather more comprehensive information by harnessing both modalities than by relying on either one independently. Furthermore, we conducted a saliency analysis to gain insights into the co-alterations in structural gray matter and functional network connectivity disrupted in schizophrenia. While it’s clear that the MultiViT model demonstrates differences compared to previous multimodal methods, the specifics of how it compares to methods such as MCCA and JICA are still under investigation, and more research is needed in this area. The findings underscore the potential of interpretable multimodal data fusion models like the MultiViT, highlighting their robustness and potential in the classification and understanding of schizophrenia.

Publisher

Cold Spring Harbor Laboratory

Reference45 articles.

1. “Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning;Nature communications,2021

2. “3d-cnn based discrimination of schizophrenia using resting-state fmri;Artificial Intelligence in Medicine,2019

3. SSPNet: An interpretable 3D-CNN for classification of schizophrenia using phase maps of resting-state complex-valued fMRI data

4. “Classification of schizophrenia and normal controls using 3d convolutional neural network and outcome visualization;Schizophrenia Research,2019

5. A. Dosovitskiy , L. Beyer , A. Kolesnikov , D. Weissenborn , X. Zhai , T. Unterthiner , M. Dehghani , M. Minderer , G. Heigold , S. Gelly et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.