Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone-Reference-Cited by-同舟云学术

Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone

Published:2024-01-01 Issue:1 Volume:40 Page:
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Pantolini Lorenzo¹²^ORCID,Studer Gabriel¹²^ORCID,Pereira Joana¹²^ORCID,Durairaj Janani¹²^ORCID,Tauriello Gerardo¹²^ORCID,Schwede Torsten¹²^ORCID

Affiliation:

1. Biozentrum, University of Basel , Basel 4056, Switzerland

2. SIB Swiss Institute of Bioinformatics , Basel 4056, Switzerland

Abstract

Abstract Motivation Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a “semantic meaning” of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. Results In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. Availability and implementation The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.

Funder

SIB Swiss Institute of Bioinformatics

Biozentrum, University of Basel

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad786/55024500/btad786.pdf

Reference32 articles.

1. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures;Andreeva;Nucleic Acids Res,2019

2. BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations;Bahr;Nucleic Acids Res,2001

3. Learning the protein language: evolution, structure, and function;Bepler;Cell Syst,2021

4. Prottrans: toward understanding the language of life through self-supervised learning;Elnaggar;IEEE Trans Pattern Anal Mach Intell,2022