On the synergies between ribosomal assembly and machine learning tools for microbial identification-Reference-Cited by-同舟云学术

On the synergies between ribosomal assembly and machine learning tools for microbial identification

Published:2022-10-03 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Chau Stephanie,Rojas Carlos,Jetcheva Jorjeta G.,Vijayakumar Sudha,Yuan Sophia,Stowbunenko Vincent,Shelton Amanda N.,Andreopoulos William B.^ORCID

Abstract

AbstractWe compare several bioinformatics approaches for the assembly and identification of 16S rRNA sequences from metagenomes: PhyloFlash, MEGAHIT and MetaSPAdes followed by Barrnap and Blast or cmsearch, as well as Mothur, UniCycler, PathRacer. We also evaluated two machine learning approaches: DNABERT and DeLUCS. We used 2 synthetic mock community datasets for our evaluation and evaluate the tools’ effectiveness on genomes of varying properties: repetitiveness, GC content, genome size, and coverage. The assembly-based tools and the machine learning approaches showed complementary performance on identifying organisms varying on these four properties. PhyloFlash gave the most True Positives identifying 22/26 organisms, and missing 4 with 1 spurious hit. PhyloFlash made mistakes on 16S reconstruction for species of higher repetitiveness, low or high genome coverage or GC content, as well as communities with similar species. On the other hand, 16S reconstruction by the whole-genome assembly followed by cmsearch, Barrnap and Blast identified some of the organisms phyloFlash missed but had more spurious hits. Cmsearch following MetaSPAdes gave the best results in whole-genome assembly for 16S identification as it identified most of the species that phyloFlash missed. The ML tools, such as DeLUCS, identified most of the species in the mock community datasets that PhyloFlash missed, but missed other species of low or high GC content, extreme repetitiveness, and small genomes. A rRNA-focused assembler like PhyloFlash gave results faster and closer to the truth, as opposed to assembling the metagenome and finding the rRNAs in the assembly (with cmsearch or barrnap) that gave many spurious hits. DeLUCS and phyloFlash showed complementary performance on our data, individually identifying most organisms with few spurious hits and together identifying almost all organisms. Our insights show the tools have various strengths and weaknesses specific to the characteristics of the genomes involved.

Publisher

Cold Spring Harbor Laboratory

Reference70 articles.

1. The Dark Side of the Mushroom Spring Microbial Mat: Life in the Shadow of Chlorophototrophs. II. Metabolic Functions of Abundant Community Members Predicted from Metagenomic Analyses

2. The Dark Side of the Mushroom Spring Microbial Mat: Life in the Shadow of Chlorophototrophs. I. Microbial Diversity Based on 16S rRNA Gene Amplicons and Metagenomic Sequencing

3. Community structure and metabolism through reconstruction of microbial genomes from the environment

4. Computational Methods for Strain-Level Microbial Detection in Colony and Metagenome Sequencing Data

5. Updating the 97% identity threshold for 16S ribosomal RNA OTUs