Abstract
ABSTRACTRepresentation learning for tumor gene expression (GEx) data with deep neural networks is limited by the large gene feature space and the scarcity of available clinical and preclinical data. The translation of the learned representation between these data sources is further hindered by inherent molecular differences. To address these challenges, we propose GExMix (GeneExpressionMixup), a data augmentation method, which extends the Mixup concept to generate training samples accounting for the imbalance in both data classes and data sources. We leverage the GExMix-augmented training set in encoder-decoder models to learn a GEx latent representation. Subsequently, we combine the learned representation with drug chemical features in a dual-objective enhanced gene-centric drug response prediction, i.e., reconstruction of GEx latent embeddings and drug response classification. This dual-objective design strategically prioritizes gene-centric information to enhance the final drug response prediction. We demonstrate that augmenting training samples improves the GEx representation, benefiting the gene-centric drug response prediction model. Our findings underscore the effectiveness of our proposed GExMix in enriching GEx data for deep neural networks. Moreover, our proposed gene-centricity further improves drug response prediction when translating preclinical to clinical datasets. This highlights the untapped potential of the proposed framework for GEx data analysis, paving the way toward precision medicine.
Publisher
Cold Spring Harbor Laboratory