Abstract
Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.
Subject
General Physics and Astronomy
Reference47 articles.
1. Toward an alignment-free method for feature extraction and accurate classification of viral sequences;Lebatteux;J. Comput. Biol.,2019
2. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison;Zhao;Comput. Biol. Chem.,2019
3. Lesk, A. (2012). Introduction to genomics, Oxford University Press.
4. An introduction to sequence similarity (“homology”) searching;Pearson;Curr. Protoc. Bioinform.,2013
5. Gupta, M.K., Niyogi, R., and Misra, M. (2013, January 8–10). A framework for alignment-free methods to perform similarity analysis of biological sequence. Proceedings of the Sixth International Conference on Contemporary Computing (IC3), Noida, India.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献