Abstract
AbstractPredicting breast cancer prognosis helps improve the treatment and management of the disease. In the last decades, many prediction models have been developed for breast cancer prognosis based on transcriptomic data. A common assumption made by these models is that the test and training data follow the same distribution. However, in practice, due to the heterogeneity of breast cancer and the different environments (e.g. hospitals) where data are collected, the distribution of the test data may shift from that of the training data. For example, new patients likely have different breast cancer stage distribution from those in the training dataset. Thus these existing methods may not provide stable prediction performance for breast cancer prognosis in situations with the shift of data distribution. In this paper, we present a novel stable prediction method for reliable breast cancer prognosis under data distribution shift. Our model, known as Deep Global Balancing Cox regression (DGBCox), is based on the causal inference theory. In DGBCox, firstly high-dimensional gene expression data is transferred to latent network-based representations by a deep auto-encoder neural network. Then after balancing the latent representations using a proposed causality-based approach, causal latent features are selected for breast cancer prognosis. Causal features have persistent relationships with survival outcomes even under distribution shift across different environments according to the causal inference theory. Therefore, the proposed DGBCox method is robust and stable for breast cancer prognosis. We apply DGBCox to 12 test datasets from different breast cancer studies. The results show that DGBCox outperforms benchmark methods in terms of both prediction accuracy and stability. We also propose a permutation importance algorithm to rank the genes in the DGBCox model. The top 50 ranked genes suggest that the cell cycle and the organelle organisation could be the most relevant biological processes for stable breast cancer prognosis.Author summaryVarious prediction models have been proposed for breast cancer prognosis. The prediction models usually train on a dataset and predict the survival outcomes of patients in new test datasets. The majority of these models share a common assumption that the test and training data follow the same distribution. However, as breast cancer is a heterogeneous disease, the assumption may be violated in practice. In this study, we propose a novel method for reliable breast cancer prognosis when the test data distribution shifts from that of the training data. The proposed model has been trained on one dataset and applied to twelve test datasets from different breast cancer studies. In comparison with the benchmark methods in breast cancer prognosis, our model shows better prediction accuracy and stability. The top 50 important genes in our model provide clues to the relationship between several biological mechanisms and clinical outcomes of breast cancer. Our proposed method in breast cancer can potentially be adapted to apply to other cancer types.
Publisher
Cold Spring Harbor Laboratory