Clustering evolving proteins into homologous families-Reference-Cited by-同舟云学术

Clustering evolving proteins into homologous families

Published:2013-04-08 Issue:1 Volume:14 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Chan Cheong Xin,Mahbob Maisarah,Ragan Mark A

Abstract

Abstract Background Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. Results Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. Conclusions Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-14-120.pdf

Reference35 articles.

1. Homology. The Hierarchical Basis of Comparative Biology. Edited by: Hall BK. 1994, San Diego: Academic Press

2. Cheng L, Walker AW, Corander J: Bayesian estimation of bacterial community composition from 454 sequencing data. Nucleic Acids Res. 2012, 40: 5240-5249. 10.1093/nar/gks227.

3. Sun Y, Cai Y, Huse SM, Knight R, Farmerie WG, Wang X, Mai V: A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform. 2012, 13: 107-121. 10.1093/bib/bbr009.

4. Cai Y, Sun Y: ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. 2011, 39: e95-10.1093/nar/gkr349.

5. Li W, Godzik A: CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22: 1658-1659. 10.1093/bioinformatics/btl158.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. VirClust—A Tool for Hierarchical Clustering, Core Protein Detection and Annotation of (Prokaryotic) Viruses;Viruses;2023-04-19

2. VirClust – a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses;2021-06-14

3. Massive expansion of human gut bacteriophage diversity;Cell;2021-02

4. Massive expansion of human gut bacteriophage diversity;2020-09-03

5. 3gClust: Human Protein Cluster Analysis;IEEE/ACM Transactions on Computational Biology and Bioinformatics;2019