Affiliation:
1. Department of Statistics Virginia Polytechnic Institute and State University Blacksburg Virginia USA
Abstract
AbstractAnalyzing
correlated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.