Sequence embedding for fast construction of guide trees for multiple sequence alignment-Reference-Cited by-同舟云学术

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Published:2010-05-14 Issue:1 Volume:5 Page:
ISSN:1748-7188
Container-title:Algorithms for Molecular Biology
language:en
Short-container-title:Algorithms Mol Biol

Author:

Blackshields Gordon,Sievers Fabian,Shi Weifeng,Wilm Andreas,Higgins Desmond G

Abstract

Abstract Background The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N 2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. Results In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. Conclusions We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computational Theory and Mathematics,Molecular Biology,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1748-7188-5-21.pdf

Reference33 articles.

1. Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J Mol Evol. 1984, 20 (2): 175-86. 10.1007/BF02257378

2. Taylor WR: Multiple sequence alignment by a pairwise algorithm. Comput Appl Biosci. 1987, 3 (2): 81-7.

3. Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987, 25 (4): 351-60. 10.1007/BF02603120

4. Notredame C, Higgins DG, Heringa J: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042

5. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-80. 10.1093/nar/22.22.4673

Cited by 88 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Accurately clustering biological sequences in linear time by relatedness sorting;Nature Communications;2024-04-08

2. Phylogenomic curation of Ovate Family Proteins (OFPs) in the U’s Triangle of Brassica L. indicates stress-induced growth modulation;PLOS ONE;2024-01-26

3. Towards the accurate alignment of over a million protein sequences: Current state of the art;Current Opinion in Structural Biology;2023-06

4. WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies;Briefings in Bioinformatics;2023-05-17

5. UPP2: fast and accurate alignment of datasets with fragmentary sequences;Bioinformatics;2023-01-01