Author:
Li Qing,Yu Yang,Kossinna Pathum,Lun Theodore,Liao Wenyuan,Zhang Qingrun
Abstract
ABSTRACTMachine Learning models have been frequently used in transcriptome analyses. Particularly, Representation Learning (RL), e.g., autoencoders, are effective in learning critical representations in noisy data. However, learned representations, e.g., the “latent variables” in an autoencoder, are difficult to interpret, not to mention prioritizing essential genes for functional follow-up. In contrast, in traditional analyses, one may identify important genes such as Differentially Expressed (DiffEx), Differentially Co-Expressed (DiffCoEx), and Hub genes. Intuitively, the complex gene-gene interactions may be beyond the capture of marginal effects (DiffEx) or correlations (DiffCoEx and Hub), indicating the need of powerful RL models. However, the lack of interpretability and individual target genes is an obstacle for RL’s broad use in practice. To facilitate interpretable analysis and gene-identification using RL, we propose “Critical genes”, defined as genes that contribute highly to learned representations (e.g., latent variables in an autoencoder). As a proof-of-concept, supported by eXplainable Artificial Intelligence (XAI), we implemented eXplainable Autoencoder for Critical genes (XA4C) that quantifies each gene’s contribution to latent variables, based on which Critical genes are prioritized. Applying XA4C to gene expression data in six cancers showed that Critical genes capture essential pathways underlying cancers. Remarkably,Critical genes has little overlap with Hub or DiffEx genes, however, has a higher enrichment in a comprehensive disease gene database (DisGeNET), evidencing its potential to disclose massive unknown biology. As an example, we discovered five Critical genes sitting in the center of Lysine degradation (hsa00310) pathway, displaying distinct interaction patterns in tumor and normal tissues. In conclusion, XA4C facilitates explainable analysis using RL and Critical genes discovered by explainable RL empowers the study of complex interactions.Author SummaryWe propose a gene expression data analysis tool, XA4C, which builds an eXplainable Autoencoder to reveal Critical genes. XA4C disentangles the black box of the neural network of an autoencoder by providing each gene’s contribution to the latent variables in the autoencoder. Next, a gene’s ability to contribute to the latent variables is used to define the importance of this gene, based on which XA4C prioritizes “Critical genes”. Notably, we discovered that Critical genes enjoy two properties: (1) Their overlap with traditional differentially expressed genes and hub genes are poor, suggesting that they indeed brought novel insights into transcriptome data that cannot be captured by traditional analysis. (2) The enrichment of Critical genes in a comprehensive disease gene database (DisGeNET) is higher than differentially expressed or hub genes, evidencing their strong relevance to disease pathology. Therefore, we conclude that XA4C can reveal an additional landscape of gene expression data.
Publisher
Cold Spring Harbor Laboratory
Reference44 articles.
1. Goodfellow I , Bengio Y , Courville A. Deep learning: MIT press; 2016.
2. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses
3. MultiPLIER: A Transfer Learning Framework for Transcriptomics Reveals Systemic Features of Rare Disease;Cell Syst,2019
4. Dwivedi SK , Tjarnberg A , Tegner J , Gustafsson M . Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat Commun. 2020;11(1).
5. Jiayi Bian QL , Albert Leung , Guotao Yang , Jun Yan , Jingjing Wu , Xingyi Guo , Quan Long . Integrating autoencoder-transformed gene expressions into TWAS studies (AE-TWAS) to identify gene-trait associations. To be submitted to Bioinformatics in May 2023. 2023.