Compression-Complexity Measures for Analysis and Classification of Coronaviruses-Reference-Cited by-同舟云学术

Compression-Complexity Measures for Analysis and Classification of Coronaviruses

Published:2022-12-31 Issue:1 Volume:25 Page:81
ISSN:1099-4300
Container-title:Entropy
language:en
Short-container-title:Entropy

Author:

Munagala Naga Venkata Trinath Sai,Amanchi Prem Kumar^ORCID,Balasubramanian Karthi,Panicker Athira,Nagaraj Nithin^ORCID

Abstract

Finding a vaccine or specific antiviral treatment for a global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In this study, we investigate the use of compression-complexity (Effort-to-Compress or ETC and Lempel-Ziv or LZ complexity) based distance measures for analyzing genomic sequences. The proposed distance measure is used to successfully reproduce the phylogenetic trees for a mammalian dataset consisting of eight species clusters, a set of coronaviruses belonging to group I, group II, group III, and SARS-CoV-1 coronaviruses, and a set of coronaviruses causing COVID-19 (SARS-CoV-2), and those not causing COVID-19. Having demonstrated the usefulness of these compression complexity measures, we employ them for the automatic classification of COVID-19-causing genome sequences using machine learning techniques. Two flavors of SVM (linear and quadratic) along with linear discriminant and fine K Nearest Neighbors classifer are used for classification. Using a data set comprising 1001 coronavirus sequences (causing COVID-19 and those not causing COVID-19), a classification accuracy of 98% is achieved with a sensitivity of 95% and a specificity of 99.8%. This work could be extended further to enable medical practitioners to automatically identify and characterize coronavirus strains and their rapidly growing mutants in a fast and efficient fashion.

Publisher

MDPI AG

Subject

General Physics and Astronomy

Link

https://www.mdpi.com/1099-4300/25/1/81/pdf

Reference47 articles.

1. Toward an alignment-free method for feature extraction and accurate classification of viral sequences;Lebatteux;J. Comput. Biol.,2019

2. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison;Zhao;Comput. Biol. Chem.,2019

3. Lesk, A. (2012). Introduction to genomics, Oxford University Press.

4. An introduction to sequence similarity (“homology”) searching;Pearson;Curr. Protoc. Bioinform.,2013

5. Gupta, M.K., Niyogi, R., and Misra, M. (2013, January 8–10). A framework for alignment-free methods to perform similarity analysis of biological sequence. Proceedings of the Sixth International Conference on Contemporary Computing (IC3), Noida, India.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Interpretation of ETC in the Context of Biomedical Signal Analysis;2023 IEEE 11th Region 10 Humanitarian Technology Conference (R10-HTC);2023-10-16

2. Bioinformatics tools for the sequence complexity estimates;Biophysical Reviews;2023-09-15