Abstract
Abstract
The MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets. We describe several use cases, including contamination screening and retrospective analysis of metagenomes for novel genome discovery. Using this tool, we provide containment estimates for every NCBI RefSeq genome within every SRA metagenome and demonstrate the identification of a novel polyomavirus species from a public metagenome.
Publisher
Springer Science and Business Media LLC
Reference41 articles.
1. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al.Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015; 44(D1):733–45.
2. RefSeq growth statistics.
https://www.ncbi.nlm.nih.gov/genbank/statistics/
. Accessed 27 Feb 2019.
3. GenBank and WGS Statistics.
http://www.ncbi.nlm.nih.gov/genbank/
. Accessed 27 Feb 2019.
4. Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011; 39(Database issue):19–21.
5. SRA database growth.
https://www.ncbi.nlm.nih.gov/sra/docs/sragrowth/
. Accessed 27 Feb 2019.
Cited by
180 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献