Leveraging ancestral sequence reconstruction for protein representation learning-Reference-Cited by-同舟云学术

Leveraging ancestral sequence reconstruction for protein representation learning

Published:2023-12-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Matthews D. S.,Spence M. A.^ORCID,Mater A. C.,Nichols J.,Pulsford S. B.^ORCID,Sandhu M.,Kaczmarski J. A.,Miton C. M.^ORCID,Tokuriki N.,Jackson C. J.

Abstract

Protein language models (PLMs) convert amino acid sequences into the numerical representations required to train machine learning (ML) models. Many PLMs are large (>600 M parameters) and trained on a broad span of protein sequence space. However, these models have limitations in terms of predictive accuracy and computational cost. Here, we use multiplexed Ancestral Sequence Reconstruction (mASR) to generate small but focused functional protein sequence datasets for PLM training. Compared to large PLMs, this local ancestral sequence embedding (LASE) produces representations 10-fold faster and with higher predictive accuracy. We show that due to the evolutionary nature of the ASR data, LASE produces smoother fitness landscapes in which protein variants that are closer in fitness value become numerically closer in representation space. This work contributes to the implementation of ML-based protein design in real-world settings, where data is sparse and computational resources are limited.

Publisher

Cold Spring Harbor Laboratory

Reference83 articles.

1. Elnaggar, A. , et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv [cs.LG] (2020).

2. Evolutionary-scale prediction of atomic-level protein structure with a language model

3. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118, (2021).

4. Unified rational protein engineering with sequence-based deep representation learning;Nat. Methods,2019

5. Language models enable zero-shot prediction of the effects of mutations on protein function

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ancestral reconstruction of polyethylene terephthalate degrading cutinases reveals a rugged and unexplored sequence-fitness landscape;2024-04-27

2. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering;ACS Central Science;2024-02-05