Multimodal Biomedical Data Fusion Using Sparse Canonical Correlation Analysis and Cooperative Learning: A Cohort Study on COVID-19-Reference-Cited by-同舟云学术

Multimodal Biomedical Data Fusion Using Sparse Canonical Correlation Analysis and Cooperative Learning: A Cohort Study on COVID-19

Published:2023-11-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Er Ahmet Gorkem¹,Ding Daisy Yi¹,Er Berrin²^ORCID,Uzun Mertcan²,Cakmak Mehmet²,Sadée Christoph¹^ORCID,Durhan Gamze²,Ozmen Mustafa Nasuh²,Tanriover Mine Durusu²,Topeli Arzu²,Son Yesim Aydin³^ORCID,Tibshirani Robert¹,Unal Serhat²,Gevaert Olivier¹^ORCID

Affiliation:

1. Stanford University

2. Hacettepe University

3. Middle East Technical University

Abstract

Abstract Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (𝑐𝑜𝑟𝑟(𝑋u𝟏, Zv𝟏) = 0.596, p-value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.

Publisher

Research Square Platform LLC

Reference63 articles.

1. High-performance medicine: the convergence of human and artificial intelligence;Topol EJ;Nature Medicine,2019

2. Multimodal data fusion for cancer biomarker discovery with deep learning;Steyaert S;Nature Machine Intelligence,2023

3. Multimodal deep learning to predict prognosis in adult and pediatric brain tumors;Steyaert S;Communications Medicine,2023

4. Deep learning with multimodal representation for pancancer prognosis prediction;Cheerla A;Bioinformatics,2019

5. Imaging genomics: data fusion in uncovering disease heritability;Hartmann K;Trends Mol Med,2023