Affiliation:
1. Department of Epidemiology and Biostatistics, Arnold School of Public Health University of South Carolina Columbia South Carolina USA
2. Data and Statistical Sciences AbbVie Inc. North Chicago Illinois USA
3. Department of Surgery, College of Medicine University of Florida Gainesville Florida USA
4. Department of Biostatistics, College of Public Health and Health Promotion & College of Medicine University of Florida Gainesville Florida USA
Abstract
AbstractCopy number variants (CNVs) are prevalent in the human genome and are found to have a profound effect on genomic organization and human diseases. Discovering disease‐associated CNVs is critical for understanding the pathogenesis of diseases and aiding their diagnosis and treatment. However, traditional methods for assessing the association between CNVs and disease risks adopt a two‐stage strategy conducting quantitative CNV measurements first and then testing for association, which may lead to biased association estimation and low statistical power, serving as a major barrier in routine genome‐wide assessment of such variation. In this article, we developed One‐Stage CNV–disease Association Analysis (OSCAA), a flexible algorithm to discover disease‐associated CNVs for both quantitative and qualitative traits. OSCAA employs a two‐dimensional Gaussian mixture model that is built upon the PCs from copy number intensities, accounting for technical biases in CNV detection while simultaneously testing for their effect on outcome traits. In OSCAA, CNVs are identified and their associations with disease risk are evaluated simultaneously in a single step, taking into account the uncertainty of CNV identification in the statistical model. Our simulations demonstrated that OSCAA outperformed the existing one‐stage method and traditional two‐stage methods by yielding a more accurate estimate of the CNV–disease association, especially for short CNVs or CNVs with weak signals. In conclusion, OSCAA is a powerful and flexible approach for CNV association testing with high sensitivity and specificity, which can be easily applied to different traits and clinical risk predictions.