Accurately clustering biological sequences in linear time by relatedness sorting-Reference-Cited by-同舟云学术

Accurately clustering biological sequences in linear time by relatedness sorting

Published:2024-04-08 Issue:1 Volume:15 Page:
ISSN:2041-1723
Container-title:Nature Communications
language:en
Short-container-title:Nat Commun

Author:

Wright Erik^ORCID

Abstract

AbstractClustering biological sequences into similar groups is an increasingly important task as the number of available sequences continues to grow exponentially. Search-based approaches to clustering scale super-linearly with the number of input sequences, making it impractical to cluster very large sets of sequences. Approaches to clustering sequences in linear time currently lack the accuracy of super-linear approaches. Here, I set out to develop and characterize a strategy for clustering with linear time complexity that retains the accuracy of less scalable approaches. The resulting algorithm, named Clusterize, sorts sequences by relatedness to linearize the clustering problem. Clusterize produces clusters with accuracy rivaling popular programs (CD-HIT, MMseqs2, and UCLUST) but exhibits linear asymptotic scalability. Clusterize generates higher accuracy and oftentimes much larger clusters than Linclust, a fast linear time clustering algorithm. I demonstrate the utility of Clusterize by accurately solving different clustering problems involving millions of nucleotide or protein sequences.

Funder

Division of Intramural Research, National Institute of Allergy and Infectious Diseases

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41467-024-47371-9.pdf

Reference49 articles.

1. Li, W., Fu, L., Niu, B., Wu, S. & Wooley, J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinforma. 13, 656–668 (2012).

2. Zou Q, Lin G, Jiang X, Liu X. & Zeng X. Sequence clustering in bioinformatics: an empirical study. Brief. Bioinform. 21, 1–10 (2018).

3. Cai, Y. & Sun, Y. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. 39, e95 (2011).

4. Blackshields, G., Sievers, F., Shi, W., Wilm, A. & Higgins, D. G. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. 5, 21 (2010).

5. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enzymatic carbon-fluorine bond cleavage by human gut microbes;2024-07-15

2. Many purported pseudogenes in bacterial genomes are bona fide genes;BMC Genomics;2024-04-15