SNP Variable Selection by Generalized Graph Domination-Reference-Cited by-同舟云学术

SNP Variable Selection by Generalized Graph Domination

Published:2018-08-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Sun Shuzhen,Miao Zhuqi,Ratcliffe Blaise^ORCID,Campbell Polly,Pasch Bret,El-Kassaby Yousry A.,Balasundaram Balabhaskar,Chen Charles^ORCID

Abstract

AbstractHigh-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p ≫ n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models.K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum K-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength ofk-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi™ optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).

Publisher

Cold Spring Harbor Laboratory

Reference95 articles.

1. Practical application of genomic selection in a doubled-haploid winter wheat breeding program;Mol Breed,2017

2. Genomic Selection in Wheat Breeding using Genotyping‐by‐Sequencing

3. Genomic Selection Accuracy for Grain Quality Traits in Biparental Wheat Populations

4. Genomic BLUP Decoded: A Look into the Black Box of Genomic Prediction

5. Genomic selection: genome-wide prediction in plant improvement