A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes-Reference-Cited by-同舟云学术

A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes

Published:2023-05-18 Issue:1 Volume:24 Page:
ISSN:1471-2164
Container-title:BMC Genomics
language:en
Short-container-title:BMC Genomics

Author:

Akbari Rokn Abadi Saeedeh,Mohammadi Amirhossein,Koohi Somayyeh

Abstract

Abstract Background The prevalence of the COVID-19 disease in recent years and its widespread impact on mortality, as well as various aspects of life around the world, has made it important to study this disease and its viral cause. However, very long sequences of this virus increase the processing time, complexity of calculation, and memory consumption required by the available tools to compare and analyze the sequences. Results We present a new encoding method, named PC-mer, based on the k-mer and physic-chemical properties of nucleotides. This method minimizes the size of encoded data by around 2 k times compared to the classical k-mer based profiling method. Moreover, using PC-mer, we designed two tools: 1) a machine-learning-based classification tool for coronavirus family members with the ability to recive input sequences from the NCBI database, and 2) an alignment-free computational comparison tool for calculating dissimilarity scores between coronaviruses at the genus and species levels. Conclusions PC-mer achieves 100% accuracy despite the use of very simple classification algorithms based on Machine Learning. Assuming dynamic programming-based pairwise alignment as the ground truth approach, we achieved a degree of convergence of more than 98% for coronavirus genus-level sequences and 93% for SARS-CoV-2 sequences using PC-mer in the alignment-free classification method. This outperformance of PC-mer suggests that it can serve as a replacement for alignment-based approaches in certain sequence analysis applications that rely on similarity/dissimilarity scores, such as searching sequences, comparing sequences, and certain types of phylogenetic analysis methods that are based on sequence comparison.

Publisher

Springer Science and Business Media LLC

Subject

Genetics,Biotechnology

Link

https://link.springer.com/content/pdf/10.1186/s12864-023-09373-7.pdf

Reference21 articles.

1. Arslan H, Arslan H. A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Eng Sci Technol an Int J. 2021;24(4):839–47. https://doi.org/10.1016/j.jestch.2020.12.026.

2. Dlamini GS, et al. Classification of COVID-19 and other pathogenic sequences: A dinucleotide frequency and machine learning approach. IEEE Access. 2020;8:195263–73. https://doi.org/10.1109/ACCESS.2020.3031387.

3. Randhawa GS, Soltysiak MPM, El Roz H, de Souza CPE, Hill KA, Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One. 2020;15(4):e0232391. https://doi.org/10.1371/journal.pone.0232391.

4. Whata A, Chimedza C. Deep Learning for SARS COV-2 Genome Sequences. IEEE Access. 2021;9:59597–611. https://doi.org/10.1109/ACCESS.2021.3073728.

5. Li X, et al., “Evolutionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2,” J Med Virol. 2020;92(6) https://doi.org/10.1002/jmv.25731.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PC-mer: An Ultra-fast memory-efficient tool for metagenomics profiling and classification;PLOS ONE;2024-08-01

2. COMPUTATIONAL TOOLS FOR THE DNA TEXT COMPLEXITY ESTIMATES FOR MICROBIAL GENOMES STRUCTURE ANALYSIS;Russian Journal of Biological Physics and Chemisrty;2024-06-06

3. Country-Based COVID-19 DNA Sequence Classification in Relation with International Travel Policy;Applied Sciences;2024-02-26

4. Application of genomic signal processing as a tool for high-performance classification of SARS-CoV-2 variants: a machine learning-based approach;Soft Computing;2024-01-22

5. Efficient Tf-Idf Method for Alignment-Free DNA Sequence Similarity Analysis;2024