ProstT5: Bilingual Language Model for Protein Sequence and Structure-Reference-Cited by-同舟云学术

ProstT5: Bilingual Language Model for Protein Sequence and Structure

Published:2023-07-25 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Heinzinger Michael^ORCID,Weissenow Konstantin^ORCID,Sanchez Joaquin Gomez,Henkel Adrian,Steinegger Martin^ORCID,Rost Burkhard^ORCID

Abstract

AbstractAdvanced Artificial Intelligence (AI) enabled large language models (LLMs) to revolutionize Natural Language Processing (NLP). Their adaptation to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. For the first time, we can now systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve in linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities by combining 1D sequences with 3D structure in one generic model. For this, we encode protein structures as token sequences using the 3Di-alphabet introduced by Foldseek. The resulting “structure-sequence” representation is processed by a pLM to extract features and patterns. Toward this end, we constructed a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein structure-sequence T5 (ProstT5), we showed improved performance for subsequent prediction tasks, and for “inverse folding”, namely the generation of novel protein sequences adopting a given structural scaffold (“fold”). Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. It paves the way for the development of tools optimizing the integration of this vast 3D structure data resource, opening new research avenues in the post AlphaFold2 era. We released our model athttps://github.com/mheinzinger/ProstT5.

Publisher

Cold Spring Harbor Laboratory

Reference86 articles.

1. Fast and accurate protein structure search with Foldseek

2. A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.

3. Modeling aspects of the language of life through transfer-learning protein sequences

4. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing

5. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Cited by 39 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A bibliometric review on application of machine learning in additive manufacturing and practical justification;Applied Materials Today;2024-10

2. Improving viral annotation with artificial intelligence;mBio;2024-09-04

3. Fine-tuning protein language models boosts predictions across diverse tasks;Nature Communications;2024-08-28

4. PHIStruct: Improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings;2024-08-24

5. Clustering protein functional families at large scale with hierarchical approaches;Protein Science;2024-08-15