Abstract
ABSTRACTAccurate datasets are essential for rigorous large-scale sequence-based analyses such as phylogenomics and pangenomics. With the increasing rate of available sequence data and the varying quality of those sequences, reliable approaches to rapidly identify and automatically remove poor quality and misidentified genomes from datasets before performing sequence-based analyses, are greatly needed. Here we present a robust, controlled, computationally efficient method to obtain species level population structures of bacterial species regardless of the number of sequences present in the analysis. Genus level datasets can also be used with our pipeline to classify genomes into their species. This methodology can be leveraged to rapidly clean datasets of entire species of bacteria and analyze the sub-species population structures present in the genomes provided. These cleaned datasets can be further reduced by a variety of methods to obtain sets of sequences with various levels of diversity that are representative of entire species.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献