Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry-Reference-Cited by-同舟云学术

Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry

Published:2023-03-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Kulikova Anastasiya V.^ORCID,Diaz Daniel J.^ORCID,Chen Tianlong,Cole T. Jeffrey,Ellington Andrew D.^ORCID,Wilke Claus O.^ORCID

Abstract

ABSTRACTDeep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.

Publisher

Cold Spring Harbor Laboratory

Reference37 articles.

1. Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation;Cell Reports,2022

2. Predicting the effect of single and multiple mutations on protein structural stability;Molecules,2018

3. ProteinBERT: A universal deep-learning model of protein sequence and function;Bioinformatics,2022

4. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations;Nature Communications;2024-07-23

2. Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability;2023-10-30

3. Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations;2023-05-15