Abstract
Genotype-to-phenotype prediction is a central problem of human genetics. In recent years, it has become possible to construct complex predictive models for phenotypes, thanks to the availability of large genome data sets as well as efficient and scalable machine learning tools. In this paper, we make a threefold contribution to this problem. First, we ask if state-of-the-art nonlinear predictive models, such as boosted decision trees, can be more efficient for phenotype prediction than conventional linear models. We find that this is indeed the case if model features include a sufficiently rich set of covariates, but probably not otherwise. Second, we ask if the conventional selection of single nucleotide polymorphisms (SNPs) by genome wide association studies (GWAS) can be replaced by a more efficient procedure, taking into account information in previously selected SNPs. We propose such a procedure, based on a sequential feature importance estimation with decision trees, and show that this approach indeed produced informative SNP sets that are much more compact than when selected with GWAS. Finally, we show that the highest prediction accuracy can ultimately be achieved by ensembling individual linear and nonlinear models. To the best of our knowledge, for some of the phenotypes that we consider (asthma, hypothyroidism), our results are a new state-of-the-art.
Funder
Russian Science Foundation
Publisher
Public Library of Science (PLoS)
Reference40 articles.
1. Pharmacogenomics: the promise of personalized medicine;L Mancinelli;Aaps Pharmsci,2000
2. Genomic selection in plant breeding: from theory to practice;JL Jannink;Briefings in functional genomics,2010
3. Regression Shrinkage and Selection via the Lasso;R Tibshirani;Journal of the Royal Statistical Society Series B (Methodological),1996
4. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank;J Qian;PLOS Genetics,2020
5. Fitting penalized regressions on very large genetic data using snpnet and bigstatsr;F Privé;bioRxiv,2020
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献