MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences-Reference-Cited by-同舟云学术

MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

Published:2018-10-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

James Benjamin T.,Girgis Hani Z.

Abstract

ABSTRACTGrouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.

Publisher

Cold Spring Harbor Laboratory

Reference42 articles.

1. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis;Curr. opinion biotechnology,2012

2. The case for cloud computing in genome informatics

3. SEED: efficient clustering of next-generation sequences

4. Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads

5. CD-HIT: accelerated for clustering the next-generation sequencing data

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An 8000 years old genome reveals the Neolithic origin of the zoonosis Brucella melitensis;Nature Communications;2024-07-20

2. Clustering biological sequences with dynamic sequence similarity threshold;BMC Bioinformatics;2022-03-30

3. MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores;2022-01-17

4. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models;NAR Genomics and Bioinformatics;2021-02-01

5. Approximate Hashing for Bioinformatics;Implementation and Application of Automata;2021