An Analysis of Protein Language Model Embeddings for Fold Prediction-Reference-Cited by-同舟云学术

An Analysis of Protein Language Model Embeddings for Fold Prediction

Published:2022-02-10 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Villegas-Morcillo Amelia^ORCID,Gomez Angel M.,Sanchez Victoria

Abstract

AbstractThe identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the LSTM-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT, and ProtT5; as well as three neural networks: Multi-Layer Perceptron (MLP), ResCNN-BGRU (RBG), and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid-level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

Publisher

Cold Spring Harbor Laboratory

Reference104 articles.

1. Improved protein structure prediction using potentials from deep learning;Nature,2020

2. Highly accurate protein structure prediction with AlphaFold

3. Accurate prediction of protein structures and interactions using a three-track neural network

4. Mihaly Varadi , Stephen Anyango , Mandar Deshpande , Sreenath Nair , Cindy Natassia , Galabina Yordanova , David Yuan , Oana Stroe , Gemma Wood , Agata Laydon , et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2021.