Abstract
AbstractMeta-omics has become commonplace in the study of microbial eukaryotes. The explosion of available data has inspired large-scale analyses, including species or taxonomic group distribution mapping, gene catalog construction, and inference on the functional roles and activities of microbial eukaryotesin situ. However, genome and transcriptome databases are prone to misannotation biases, and meta-omic inventories may have no recoverable taxonomic annotation for more than half of assembled contigs or predicted proteins. Direct mapping solely to organisms of interest might introduce a problematic misattribution bias, while full databases can annotate any cataloged organism but may be imbalanced between taxa. Here, we explore the potential pitfalls of common approaches to taxonomic annotation of protistan meta-omic datasets. We argue that ongoing curation of genetic resources is critical in accurately annotating protistsin situin meta-omic datasets. Moreover, we propose that precise taxonomic annotation of meta-omic data is a clustering problem rather than a feasible alignment problem. We show that taxonomic membership of sequence clusters demonstrates more accurate estimated community composition than returning exact sequence labels, and overlap between clusters can address database shortcomings. Clustering approaches can be applied to diverse environments while continuing to exploit the wealth of annotation data collated in databases, and database selection and evaluation is a critical part of correctly annotating protistan taxonomy in environmental datasets. We re-analyze three environmental datasets at three levels of taxonomic hierarchy in order to illustrate the critical importance of both database completeness and curation in enabling accurate environmental interpretation.
Publisher
Cold Spring Harbor Laboratory
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献