Deep embedding and alignment of protein sequences-Reference-Cited by-同舟云学术

Deep embedding and alignment of protein sequences

Published:2021-11-15 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Llinares-López Felipe,Berthet Quentin,Blondel Mathieu^ORCID,Teboul Olivier,Vert Jean-Philippe^ORCID

Abstract

AbstractProtein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here, we leverage recent advances in deep learning for language modelling and differentiable programming to propose DEDAL, a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or three-fold the alignment correctness over existing methods on remote homologs, and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

Publisher

Cold Spring Harbor Laboratory

Reference67 articles.

1. Prakash, T. & Taylor, T. D. Functional assignment of metagenomic data: challenges and applications. Brief Bioinform. 13 (2012).

2. Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families

3. Protein 3D Structure Computed from Evolutionary Sequence Variation

4. Highly accurate protein structure prediction with AlphaFold;Nature,2021

5. Identification of common molecular subsequences

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Protein language model powers accurate and fast sequence search for remote homology;2023-04-05

2. pLM-BLAST – distant homology detection based on direct comparison of sequence representations from protein language models;2022-11-25

3. TM-Vec: template modeling vectors for fast homology detection and alignment;2022-07-27

4. Harnessing machine translation methods for sequence alignment;2022-07-23

5. learnMSA: learning and aligning large protein families;GigaScience;2022