Author:
King Samuel,Chen Xinyi E.,Ng Sarah W. S.,Rostin Kimia,Hahn Samuel V.,Roberts Tylo,Schwab Janella C.,Sekhon Parneet,Kagieva Madina,Reilly Taylor,Qi Ruo Chen,Salman Paarsa,Hong Ryan J.,Ma Eric J.,Hallam Steven J.
Abstract
The emergence of SARS-CoV-2 variants during the COVID-19 pandemic caused frequent global outbreaks that confounded public health efforts across many jurisdictions, highlighting the need for better understanding and prediction of viral evolution. Predictive models have been shown to support disease prevention efforts, such as with the seasonal influenza vaccine, but they require abundant data. For emerging viruses of concern, such models should ideally function with relatively sparse data typically encountered at the early stages of a viral outbreak. Conventional discrete approaches have proven difficult to develop due to the spurious and reversible nature of amino acid mutations and the overwhelming number of possible protein sequences adding computational complexity. We hypothesized that these challenges could be addressed by encoding discrete protein sequences into continuous numbers, effectively reducing the data size while enhancing the resolution of evolutionarily relevant differences. To this end, we developed a viral protein evolution prediction model (VPRE), which reduces amino acid sequences into continuous numbers by using an artificial neural network called a variational autoencoder (VAE) and models their most statistically likely evolutionary trajectories over time using Gaussian process (GP) regression. To demonstrate VPRE, we used a small amount of early SARS-CoV-2 spike protein sequences. We show that the VAE can be trained on a synthetic dataset based on this data. To recapitulate evolution along a phylogenetic path, we used only 104 spike protein sequences and trained the GP regression with the numerical variables to project evolution up to 5 months into the future. Our predictions contained novel variants and the most frequent prediction mapped primarily to a sequence that differed by only a single amino acid from the most reported spike protein within the prediction timeframe. Novel variants in the spike receptor binding domain (RBD) were capable of binding human angiotensin-converting enzyme 2 (ACE2) in silico, with comparable or better binding than previously resolved RBD-ACE2 complexes. Together, these results indicate the utility and tractability of combining deep learning and regression to model viral protein evolution with relatively sparse datasets, toward developing more effective medical interventions.