Affiliation:
1. Department of Computational Biology Cornell University Ithaca New York USA
2. Department of Statistics University of Missouri Columbia Missouri USA
3. Department of Surgery Center for Surgery and Public Health, Brigham and Women's Hospital, Harvard Medical School Boston Massachusetts USA
Abstract
AbstractWith the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.