Removing bias in sequence models of protein fitness-Reference-Cited by-同舟云学术

Removing bias in sequence models of protein fitness

Published:2023-09-30 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Shaw Ada^ORCID,Spinner Hansen,Shin June^ORCID,Gurev Sarah^ORCID,Rollins Nathan^ORCID,Marks Debora^ORCID

Abstract

ABSTRACTUnsupervised sequence models for protein fitness have emerged as powerful tools for protein design in order to engineer therapeutics and industrial enzymes, yet they are strongly biased towards potential designs that are close to their training data. This hinders their ability to generate functional sequences that are far away from natural sequences, as is often desired to design new functions. To address this problem, we introduce a de-biasing approach that enables the comparison of protein sequences across mutational depths to overcome the extant sequence similarity bias in natural sequence models. We demonstrate our method’s effectiveness at improving the relative natural sequence model predictions of experimentally measured variant functions across mutational depths. Using case studies proteins with very low functional percentages further away from the wild type, we demonstrate that our method improves the recovery of top-performing variants in these sparsely functional regimes. Our method is generally applicable to any unsupervised fitness prediction model, and for any function for any protein, and can thus easily be incorporated into any computational protein design pipeline. These studies have the potential to develop more efficient and cost-effective computational methods for designing diverse functional proteins and to inform underlying experimental library design to best take advantage of machine learning capabilities.

Publisher

Cold Spring Harbor Laboratory

Reference34 articles.

1. Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity;Nat. Commun,2022

2. Machine-Directed evolution of an imine reductase for activity and stereoselectivity;ACS Catal,2021

3. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design

4. Deep diversification of an AAV capsid protein by machine learning;Nat. Biotechnol,2021

5. Sinai, S. , Jain, N. , Church, G. M. & Kelsic, E. D. Generative AAV capsid diversification by latent interpolation (2021).

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation;2024-07-13

2. Protein language models are biased by unequal sequence sampling across the tree of life;2024-03-12

3. Addressing the antibody germline bias and its effect on language models for improved antibody design;2024-02-07

4. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering;ACS Central Science;2024-02-05

5. Continuous evolution of user-defined genes at 1-million-times the genomic mutation rate;2023-11-14