Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction-Reference-Cited by-同舟云学术

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Published:2023-11-18 Issue:22 Volume:24 Page:16496
ISSN:1422-0067
Container-title:International Journal of Molecular Sciences
language:en
Short-container-title:IJMS

Author:

Qu Yang¹²^ORCID,Niu Zitong¹²,Ding Qiaojiao¹²,Zhao Taowa¹²,Kong Tong²,Bai Bing²,Ma Jianwei²,Zhao Yitian¹²,Zheng Jianping¹²

Affiliation:

1. Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China

2. Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China

Abstract

Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.

Funder

National Key R&D Program of China

Ningbo Institute of Materials Technology and Engineering (NIMTE), CAS

3315 Innovation Team Foundation of Ningbo

Publisher

MDPI AG

Subject

Inorganic Chemistry,Organic Chemistry,Physical and Theoretical Chemistry,Computer Science Applications,Spectroscopy,Molecular Biology,General Medicine,Catalysis

Link

https://www.mdpi.com/1422-0067/24/22/16496/pdf

Reference69 articles.

1. Total human body protein synthesis in relation to protein requirements at various ages;Young;Nature,1975

2. The structural role of the carrier protein–active controller or passive carrier;Crosby;Nat. Prod. Rep.,2012

3. Tailoring enzyme activity and stability using polymer-based protein engineering;Cummings;Biomaterials,2013

4. Using machine learning to predict the effects and consequences of mutations in proteins;Diaz;Curr. Opin. Struct. Biol.,2023

5. Exploring protein fitness landscapes by directed evolution;Romero;Nat. Rev. Mol. Cell Biol.,2009

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Advancing virulence factor prediction using protein language models;2024-07-29

2. Empowering Protein Engineering through Recombination of Beneficial Substitutions;Chemistry – A European Journal;2024-02-22