Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life-Reference-Cited by-同舟云学术

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

Published:2020-09-21 Issue:1 Volume:21 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Zhao Zhengqiao,Cristian Alexandru,Rosen Gail^ORCID

Abstract

Abstract Background It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. Results We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. Conclusions It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-020-03744-7.pdf

Reference46 articles.

1. Zynda GJ. Exponential growth of NCBI genomes. http://gregoryzynda.com/ncbi/genome/python/2014/03/31/ncbi-genome.html. Accessed 07 June 2019.

2. Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2008; 37(suppl_1):5–15. https://doi.org/10.1093/nar/gkn741.

3. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2008; 37(suppl_1):26–31. https://doi.org/10.1093/nar/gkn723.

4. Kyrpides NC, Hugenholtz P, Eisen JA, Woyke T, Göker M, Parker CT, Amann R, Beck BJ, Chain PSG, Chun J, Colwell RR, Danchin A, Dawyndt P, Dedeurwaerdere T, DeLong EF, Detter JC, De Vos P, Donohue TJ, Dong X-Z, Ehrlich DS, Fraser C, Gibbs R, Gilbert J, Gilna P, Glöckner FO, Jansson JK, Keasling JD, Knight R, Labeda D, Lapidus A, Lee J-S, Li W-J, MA J, Markowitz V, Moore ERB, Morrison M, Meyer F, Nelson KE, Ohkuma M, Ouzounis CA, Pace N, Parkhill J, Qin N, Rossello-Mora R, Sikorski J, Smith D, Sogin M, Stevens R, Stingl U, Suzuki K. -i., Taylor D, Tiedje JM, Tindall B, Wagner M, Weinstock G, Weissenbach J, White O, Wang J, Zhang L, Zhou Y-G, Field D, Whitman WB, Garrity GM, Klenk H-P. Genomic encyclopedia of bacteria and archaea: Sequencing a myriad of type strains. PLoS Biol. 2014; 12(8):1001920. https://doi.org/10.1371/journal.pbio.1001920.

5. Cullen CM, Aneja KK, Beyhan S, Cho CE, Woloszynek S, Convertino M, McCoy SJ, Zhang Y, Anderson MZ, Alvarez-Ponce D, Smirnova E, Karstens L, Dorrestein PC, Li H, Gupta AS, Cheung KKW, Powers JG, Zhao Z, Rosen GL. Emerging priorities for microbiome research. Front Microbiol. 2020; 11:136.

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Scorpio : Enhancing Embeddings to Improve Downstream Analysis of DNA sequences;2024-07-23

2. The Naïve Bayes Classifier++ for Metagenomic Taxonomic Classification – Query Evaluation;2024-06-29

3. YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample;Bioinformatics;2024-01-24

4. Clinical Cytogenetics: Current Practices and Beyond;The Journal of Applied Laboratory Medicine;2024-01

5. ganon2: up-to-date and scalable metagenomics analysis;2023-12-08