Predicting binding affinities of emerging variants of SARS-CoV-2 using spike protein sequencing data: observations, caveats and recommendations-Reference-Cited by-同舟云学术

Predicting binding affinities of emerging variants of SARS-CoV-2 using spike protein sequencing data: observations, caveats and recommendations

Published:2022-04-18 Issue:3 Volume:23 Page:
ISSN:1467-5463
Container-title:Briefings in Bioinformatics
language:en
Short-container-title:

Author:

Zhang Ruibo¹,Ghosh Souparno²,Pal Ranadip¹

Affiliation:

1. Department of Electrical and Computer Engineering, Texas Tech University, TX, USA

2. Department of Statistics, University of Nebraska - Lincoln, NB, USA

Abstract

Abstract Predicting protein properties from amino acid sequences is an important problem in biology and pharmacology. Protein–protein interactions among SARS-CoV-2 spike protein, human receptors and antibodies are key determinants of the potency of this virus and its ability to evade the human immune response. As a rapidly evolving virus, SARS-CoV-2 has already developed into many variants with considerable variation in virulence among these variants. Utilizing the proteomic data of SARS-CoV-2 to predict its viral characteristics will, therefore, greatly aid in disease control and prevention. In this paper, we review and compare recent successful prediction methods based on long short-term memory (LSTM), transformer, convolutional neural network (CNN) and a similarity-based topological regression (TR) model and offer recommendations about appropriate predictive methodology depending on the similarity between training and test datasets. We compare the effectiveness of these models in predicting the binding affinity and expression of SARS-CoV-2 spike protein sequences. We also explore how effective these predictive methods are when trained on laboratory-created data and are tasked with predicting the binding affinity of the in-the-wild SARS-CoV-2 spike protein sequences obtained from the GISAID datasets. We observe that TR is a better method when the sample size is small and test protein sequences are sufficiently similar to the training sequence. However, when the training sample size is sufficiently large and prediction requires extrapolation, LSTM embedding and CNN-based predictive model show superior performance.

Funder

National Science Foundation

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Link

https://academic.oup.com/bib/article-pdf/23/3/bbac128/43745858/bbac128.pdf

Reference41 articles.

1. Structural basis of receptor recognition by sars-cov-2;Shang;Nature,2020

2. Mapping neutralizing and immunodominant sites on the sars-cov-2 spike receptor-binding domain by structure-guided high-resolution serology;Piccoli;Cell,2020

3. New mutations raise specter of ‘immune escape’;Kupferschmidt;Science,2021