Machine learning as an effective method for identifying true SNPs in polyploid plants

Author:

Korani WalidORCID,Clevenger Josh P.ORCID,Chu Ye,Ozias-Akins Peggy

Abstract

AbstractSingle Nucleotide Polymorphisms (SNPs) have many advantages as molecular markers since they are ubiquitous and co-dominant. However, the discovery of true SNPs especially in polyploid species is difficult. Peanut is an allopolyploid, which has a very low rate of true SNP calling. A large set of true and false SNPs identified from the Arachis 58k Affymetrix array was leveraged to train machine learning models to select true SNPs straight from sequence data. These models achieved accuracy rates of above 80% using real peanut RNA-seq and whole genome shotgun (WGS) re-sequencing data, which is higher than previously reported for polyploids. A 48K SNP array, Axiom Arachis2, was designed using the approach which revealed 75% accuracy of calling SNPs from different tetraploid peanut genotypes. Using the method to simulate SNP variation in peanut, cotton, wheat, and strawberry, we show that models built with our parameter sets achieve above 98% accuracy in selecting true SNPs. Additionally, models built with simulated genotypes were able to select true SNPs at above 80% accuracy using real peanut data, demonstrating that our model can be used even if real data are not available to train the models. This work demonstrates an effective approach for calling highly reliable SNPs from polyploids using machine learning. A novel tool was developed for predicting true SNPs from sequence data, designated as SNP-ML (SNP-Machine Learning, pronounced “snip mill”), using the described models. SNP-ML additionally provides functionality to train new models not included in this study for customized use, designated SNP-MLer (SNP-Machine Learner, pronounced “snip miller”). SNP-ML is freely available for public use.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut

2. RFLP variability in peanut (Arachis hypogaea L.) cultivars and wild species;TAG Theoretical and applied genetics Theoretische und Angewandte Genetik,1991

3. Genome-wide polymorphism detection in peanut using next-generation restriction-site-associated DNA (RAD) sequencing;Molecular Breeding,2015

4. Construction of a SNP-based genetic linkage map in cultivated peanut based on large scale marker development using next-generation double-digest restriction-site-associated DNA sequencing (ddRADseq)

5. Single Nucleotide Polymorphism–based genetic diversity in the reference set of peanut (Arachis spp.) by developing and applying cost-effective kompetitive allele specific polymerase chain reaction genotyping assays;The Plant Genome,2013

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3