Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods-Reference-Cited by-同舟云学术

Protein Fitness Prediction is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Published:2023-02-10 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Mardikoraem Mehrsa^ORCID,Woldring Daniel^ORCID

Abstract

AbstractAdvances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed our ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations. Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling methods and protein representations to improve model performance in two different datasets with binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding, physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, length, data size, and sampling methods. In addition, an ensemble of representation methods is generated to discover the contribution of distinct representations to the final prediction score. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. In addition, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).

Publisher

Cold Spring Harbor Laboratory

Reference68 articles.

1. Visual account of protein investment in cellular functions

2. Cell Signaling by Receptor Tyrosine Kinases

3. Bone morphogenetic proteins: multifunctional regulators of vertebrate development.

4. Construction of Escherichia coli K‐12 in‐frame, single‐gene knockout mutants: the Keio collection

5. Hierarchical structures made of proteins. The complex architecture of spider webs and their constituent silk proteins

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Stacking Machine Learning Method for IL-10-Induced Peptide Sequence Recognition Based on Unified Deep Representation Learning;Applied Sciences;2023-08-17

2. Protein engineering via sequence-performance mapping;Cell Systems;2023-08

3. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering;ACS Synthetic Biology;2023-07-31

4. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering;2023-05-11