Abstract
AbstractLanguage models are now routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful tools in the bioinformatics field. Protein language models (pLMs) generate high dimensional embeddings on a per-residue level and encode the “semantic meaning” of each individual amino acid in the context of the full protein sequence. Multiple works use these representations as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA), and show how these capture structural similarities even in the twilight zone, outperforming both classical sequence-based scores and other approaches based on protein language models. The method shows excellent accuracy despite the absence of training and parameter optimization. We expect that the association of pLMs and alignment methods will soon rise in popularity, helping the detection of relationships between proteins in the twilight-zone.
Publisher
Cold Spring Harbor Laboratory
Reference25 articles.
1. Caretta – a multiple protein structure alignment and feature extraction suite;Computational and Structural Biotechnology Journal,2020
2. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures
3. Learning the protein language: Evolution, structure, and function;Cell Systems,2021
4. Prottrans: Toward understanding the language of life through self-supervised learning;IEEE Transactions on Pattern Analysis and Machine Intelligence,2022
5. Ferruz, N. , Heinzinger, M. , Akdel, M. , Goncearenco, A. , Naef, L. , Dallago, C. : From sequence to function through structure: deep learning for protein design. Computational and Structural Biotechnology Journal (2022). https://doi.org/10.1016/j.csbj.2022.11.014, https://www.sciencedirect.com/science/article/pii/S2001037022005086
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献