Abstract
AbstractThe cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had < 1.5% error in length estimation compared to 34% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://github.com/shahab-sarmashghi/RESPECT.git
Publisher
Cold Spring Harbor Laboratory
Reference42 articles.
1. E. Brondizio , J. Settele , S. Diaz , and H. Ngo , “Global assessment report on biodiversity and ecosystem services of the intergovernmental science-policy platform on biodiversity and ecosystem services,” IPBES Secretariat, Bonn, 2019.
2. K. V. Rosenberg , A. M. Dokter , P. J. Blancher , J. R. Sauer , A. C. Smith , P. A. Smith , J. C. Stanton , A. Panjabi , L. Helft , M. Parr , and P. P. Marra , “Decline of the North American avifauna,” Science, p. eaaw1313, sep 2019.
3. Earth BioGenome Project: Sequencing life for the future of life
4. Biological identifications through DNA barcodes
5. Towards writing the encyclopaedia of life: an introduction to DNA barcoding
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献