Clustering biological sequences with dynamic sequence similarity threshold-Reference-Cited by-同舟云学术

Clustering biological sequences with dynamic sequence similarity threshold

Published:2022-03-30 Issue:1 Volume:23 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Chiu Jimmy Ka Ho,Ong Rick Twee-Hee

Abstract

AbstractBackgroundBiological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences.ResultsWe present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained.ConclusionsALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.

Funder

Saw Swee Hock School of Public Health, National University of Singapore

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-022-04643-9.pdf

Reference43 articles.

1. Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview. WIREs Data Min Knowl Discov. 2012;2(1):86–97.

2. National Center for Biotechnology Information (NCBI): Documentation of the BLASTCLUST-algorithm. ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html.

3. Enright AJ, Ouzounis CA. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics. 2000;16(5):451–7.

4. Loewenstein Y, Portugaly E, Fromer M, Linial M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. 2008;24(13):i41–9.

5. Uchiyama I. Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res. 2006;34(2):647–58.

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Applicability and perspectives for DNA barcoding of soil invertebrates;PeerJ;2024-07-24

2. GradHC: highly reliable gradual hash-based clustering for DNA storage systems;Bioinformatics;2024-04-22

3. Accurately clustering biological sequences in linear time by relatedness sorting;Nature Communications;2024-04-08

4. AlignScape, displaying sequence similarity using self-organizing maps;Frontiers in Bioinformatics;2024-01-26

5. EdtClust: A fast homologous protein sequences clustering method based on edit distance;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05