Abstract
AbstractLarge-scale genome-wide association studies (GWAS) have been successfully applied to a wide range of genetic variants underlying complex diseases. The network-based penalized regression approach has been developed to overcome the challenges caused by the computational efficiency for analyzing high-dimensional genomic data by incorporating a biological genetic network. In this paper, we propose a gene selection approach by incorporating genetic networks into case-control association studies for DNA sequence data or DNA methylation data. Instead of using traditional dimension reduction techniques such as principal component analyses and supervised principal component analyses, we use a linear combination of genotypes at SNPs or methylation values at CpG sites in each gene to capture gene-level signals. We develop three approaches for the linear combination: optimally weighted sum (OWS), LD-adjusted polygenic risk score (LD-PRS), and beta-based weighted sum (BWS). OWS and LD-PRS are supervised approaches that depend on the effect of each SNP or CpG site on the case-control status, while BWS can be extracted without using the case-control status. After using one of the linear combinations of genotypes or methylation values in each gene to capture gene-level signals, we regularize them to perform gene selection based on the biological network. Simulation studies show that the proposed approaches have higher true positive rates than using traditional dimension reduction techniques. We also apply our approaches to DNA methylation data and UK Biobank DNA sequence data for analyzing rheumatoid arthritis. The results show that the proposed methods can select potentially rheumatoid arthritis related genes that are missed by existing methods.Author SummaryThere is strong evidence showing that when genes are functionally related to each other in a genetic network, statistical methods utilizing prior biological network knowledge can outperform other methods that ignore genetic network structures. Therefore, statistical methods that can incorporate genetic network information into association analysis in human genetic association studies have been widely used since 2008. Here, we take advantage of recently developed methods to capture the gene-level signals in network-based penalized regression of high-dimensional genetic data. We have shown that the selection performance of our proposed methods can outperform three traditional principal component-based dimension reduction techniques in several simulation scenarios in terms of true positive rates. Meanwhile, by applying our methods in both DNA methylation data and DNA sequence data, the genes identified by our proposed methods can be significantly enriched into the rheumatoid arthritis pathway, such as genesHLA-DMA,HLA-DPB1, andHLA-DQA2in the HLA region.
Publisher
Cold Spring Harbor Laboratory