Author:
James Benjamin T.,Girgis Hani Z.
Abstract
ABSTRACTGrouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.
Publisher
Cold Spring Harbor Laboratory
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献