Abstract
AbstractWe compare several bioinformatics approaches for the assembly and identification of 16S rRNA sequences from metagenomes: PhyloFlash, MEGAHIT and MetaSPAdes followed by Barrnap and Blast or cmsearch, as well as Mothur, UniCycler, PathRacer. We also evaluated two machine learning approaches: DNABERT and DeLUCS. We used 2 synthetic mock community datasets for our evaluation and evaluate the tools’ effectiveness on genomes of varying properties: repetitiveness, GC content, genome size, and coverage. The assembly-based tools and the machine learning approaches showed complementary performance on identifying organisms varying on these four properties. PhyloFlash gave the most True Positives identifying 22/26 organisms, and missing 4 with 1 spurious hit. PhyloFlash made mistakes on 16S reconstruction for species of higher repetitiveness, low or high genome coverage or GC content, as well as communities with similar species. On the other hand, 16S reconstruction by the whole-genome assembly followed by cmsearch, Barrnap and Blast identified some of the organisms phyloFlash missed but had more spurious hits. Cmsearch following MetaSPAdes gave the best results in whole-genome assembly for 16S identification as it identified most of the species that phyloFlash missed. The ML tools, such as DeLUCS, identified most of the species in the mock community datasets that PhyloFlash missed, but missed other species of low or high GC content, extreme repetitiveness, and small genomes. A rRNA-focused assembler like PhyloFlash gave results faster and closer to the truth, as opposed to assembling the metagenome and finding the rRNAs in the assembly (with cmsearch or barrnap) that gave many spurious hits. DeLUCS and phyloFlash showed complementary performance on our data, individually identifying most organisms with few spurious hits and together identifying almost all organisms. Our insights show the tools have various strengths and weaknesses specific to the characteristics of the genomes involved.
Publisher
Cold Spring Harbor Laboratory