Author:
Balaban Metin,Moshiri Niema,Mai Uyen,Mirarab Siavash
Abstract
AbstractClustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given a (not necessarily ultrametric) tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints that limit the diameter of each cluster, the sum of its branch lengths, or chains of pairwise distances. These three versions of the problem can be solved in time that increases linearly with the size of the tree, a fact that has been known by computer scientists for two of these three criteria for decades. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU picking for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available athttps://github.com/niemasd/TreeCluster.
Publisher
Cold Spring Harbor Laboratory
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献