Affiliation:
1. Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, 16424, Indonesia
Abstract
The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献