learnMSA: learning and aligning large protein families-Reference-Cited by-同舟云学术

learnMSA: learning and aligning large protein families

Published:2022 Issue: Volume:11 Page:
ISSN:2047-217X
Container-title:GigaScience
language:en
Short-container-title:

Author:

Becker Felix¹^ORCID,Stanke Mario¹^ORCID

Affiliation:

1. Institute of Mathematics and Computer Science, University of Greifswald , Walther-Rathenau-Straße 47, 17489 Greifswald , Germany

Abstract

Abstract Background The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. Results We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. Conclusions Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

Publisher

Oxford University Press (OUP)

Subject

Computer Science Applications,Health Informatics

Link

https://academic.oup.com/gigascience/article-pdf/doi/10.1093/gigascience/giac104/47119218/giac104.pdf

Reference49 articles.

1. Accelerated profile HMM searches;Eddy;PLoS Comp Biol,2011

2. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions;Mistry;Nucleic Acids Res,2013

3. Hidden Markov models in computational biology: applications to protein modeling;Krogh;J Mol Biol,1994

4. Multiple alignment using hidden Markov models;Eddy;Proc Int Conf Intell Syst Mol Biol,1995

5. Hidden Markov models in molecular biology: new algorithms and applications;Baldi;Adv Neural Info Process Syst,1992

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. learnMSA2: deep protein multiple alignments with large language and hidden Markov models;Bioinformatics;2024-09-01

2. Tiberius: End-to-End Deep Learning with an HMM for Gene Prediction;2024-07-23

3. Towards the accurate alignment of over a million protein sequences: Current state of the art;Current Opinion in Structural Biology;2023-06

4. Phylogenetic analysis of promoter regions of human Dolichol kinase (DOLK) and orthologous genes using bioinformatics tools;Open Life Sciences;2023-01-01

5. learnMSA: learning and aligning large protein families;GigaScience;2022