Abstract
ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to a generic or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging. Researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities which vary in species number and specimen abundance, while holding upstream molecular and bioinformatic variables constant. It also evaluates the impact of parameter optimization on the quality of the predictions. Despite the general belief that BLAST top hit underperforms newer methods, our results indicate that it competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods that were benchmarked proved more sensitive to the reference database heterogeneity and completeness than methods based on sequence similarity. The accuracy of assignments was impacted by both species and specimen counts which will influence the selection of appropriate software. We urge the usage of realistic mock communities to allow optimization of parameters, regardless of the taxonomic assignment method used.
Publisher
Cold Spring Harbor Laboratory