Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space-Reference-Cited by-同舟云学术

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space

Published:2024-03-07 Issue:11 Volume:121 Page:
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc. Natl. Acad. Sci. U.S.A.

Author:

Case Marshall¹,Smith Matthew¹²,Vinh Jordan³,Thurber Greg¹³^ORCID

Affiliation:

1. Chemical Engineering, University of Michigan, Ann Arbor, MI 48109

2. Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109

3. Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109

Abstract

Proteins are a diverse class of biomolecules responsible for wide-ranging cellular functions, from catalyzing reactions to recognizing pathogens. The ability to evolve proteins rapidly and inexpensively toward improved properties is a common objective for protein engineers. Powerful high-throughput methods like fluorescent activated cell sorting and next-generation sequencing have dramatically improved directed evolution experiments. However, it is unclear how to best leverage these data to characterize protein fitness landscapes more completely and identify lead candidates. In this work, we develop a simple yet powerful framework to improve protein optimization by predicting continuous protein properties from simple directed evolution experiments using interpretable, linear machine learning models. Importantly, we find that these models, which use data from simple but imprecise experimental estimates of protein fitness, have predictive capabilities that approach more precise but expensive data. Evaluated across five diverse protein engineering tasks, continuous properties are consistently predicted from readily available deep sequencing data, demonstrating that protein fitness space can be reasonably well modeled by linear relationships among sequence mutations. To prospectively test the utility of this approach, we generated a library of stapled peptides and applied the framework to predict affinity and specificity from simple cell sorting data. We then coupled integer linear programming, a method to optimize protein fitness from linear weights, with mutation scores from machine learning to identify variants in unseen sequence space that have improved and co-optimal properties. This approach represents a versatile tool for improved analysis and identification of protein variants across many domains of protein engineering.

Funder

NIH

Publisher

Proceedings of the National Academy of Sciences

Link

https://pnas.org/doi/pdf/10.1073/pnas.2311726121

Reference65 articles.

1. Principles that Govern the Folding of Protein Chains

2. Directed evolution: Past, present, and future

3. Evolution of a Catabolic Pathway in Bacteria

4. RNA-peptide fusions for the in vitro selection of peptides and proteins;Roberts R. W.;Biochemistry,1997

5. Filamentous Fusion Phage: Novel Expression Vectors That Display Cloned Antigens on the Virion Surface

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Training data composition determines machine learning generalization and biological rule discovery;2024-06-19