Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences-Reference-Cited by-同舟云学术

Unveiling the Robustness of Machine Learning Models in Classifying COVID-19 Spike Sequences

Published:2023-08-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ali Sarwan,Chen Pin-Yu,Patterson Murray

Abstract

AbstractIn the midst of the global COVID-19 pandemic, a wealth of data has become available to researchers, presenting a unique opportunity to investigate the behavior of the virus. This research aims to facilitate the design of efficient vaccinations and proactive measures to prevent future pandemics through the utilization of machine learning (ML) models for decision-making processes. Consequently, ensuring the reliability of ML predictions in these critical and rapidly evolving scenarios is of utmost importance. Notably, studies focusing on the genomic sequences of individuals infected with the coronavirus have revealed that the majority of variations occur within a specific region known as the spike (or S) protein. Previous research has explored the analysis of spike proteins using various ML techniques, including classification and clustering of variants. However, it is imperative to acknowledge the possibility of errors in spike proteins, which could lead to misleading outcomes and misguide decision-making authorities. Hence, a comprehensive examination of the robustness of ML and deep learning models in classifying spike sequences is essential. In this paper, we propose a framework for evaluating and benchmarking the robustness of diverse ML methods in spike sequence classification. Through extensive evaluation of a wide range of ML algorithms, ranging from classical methods like naive Bayes and logistic regression to advanced approaches such as deep neural networks, our research demonstrates that utilizingk-mers for creating the feature vector representation of spike proteins is more effective than traditional one-hot encoding-based embedding methods. Additionally, our findings indicate that deep neural networks exhibit superior accuracy and robustness compared to non-deep-learning baselines. To the best of our knowledge, this study is the first to benchmark the accuracy and robustness of machine-learning classification models against various types of random corruptions in COVID-19 spike protein sequences. The benchmarking framework established in this research holds the potential to assist future researchers in gaining a deeper understanding of the behavior of the coronavirus, enabling the implementation of proactive measures and the prevention of similar pandemics in the future.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. Principal component analysis;Wiley interdisciplinary reviews: computational statistics,2010

2. Ali, S. , Sahoo, B. , Ullah, N. , Zelikovskiy, A. , Patterson, M.D. , Khan, I. : A k-mer based approach for sars-cov-2 variant identification. Accepted for publication at “International Symposium on Bioinformatics Research and Applications (ISBRA)” (2021)

3. Ali, S. , Patterson, M. : Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences. CoRR arXiv:2109.05019 (2021)

4. Benchmarking machine learning robustness in covid-19 genome sequence classification;Scientific Reports,2023

5. Ali, S. , Tamkanat-E-Ali Khan, M.A. , Khan, I. , Patterson, M. , et al.: Effective and scalable clustering of sars-cov-2 sequences. Accepted for publication at “International Conference on Big Data Research (ICBDR)” (2021)