Abstract
ABSTRACTBackgroundDNA sequences harbor vital information regarding various organisms and viruses. The ability to analyze extensive DNA sequences using methods amenable to conventional computer hardware has proven invaluable, especially in timely response to global pandemics such as COVID-19.ObjectivesThis study introduces a new representation that encodes DNA sequences in unit vector transitions in a 2D space, extracted from the 2019 repository Novel Coronavirus Resource (2019nCoVR). The main objective is to elucidate the potential of this method to facilitate virus classification using minimal hardware resources. It also aims to demonstrate the feasibility of the technique through dimensionality reduction and the application of machine learning models.MethodsDNA sequences were transformed into two-nucleotide base transitions (referred to as ‘transitions’). Each transition was represented as a corresponding unit vector in 2D space. This coding scheme allowed DNA sequences to be efficiently represented as dynamic transitions. After applying a moving average and resampling, these transitions underwent dimensionality reduction processes such as Principal Component Analysis (PCA). After subsequent processing and dimensionality reduction, conventional machine learning approaches were applied, obtaining as output a multiple classification among six species of viruses belonging to the coronaviridae family, including SARS-CoV-2.Results and DiscussionsThe implemented method effectively facilitated a careful representation of the sequences, allowing visual differentiation between six types of viruses from the Coronaviridae family through direct plotting. The results obtained by this technique reveal values accuracy, sensitivity, specificity and F1-score equal to or greater than 99%, applied in a stratified cross-validation, used to evaluate the model. The results found produced performance comparable, if not superior, to the computationally intensive methods discussed in the state of the art.ConclusionsThe proposed coding method appears as a computationally efficient and promising addition to contemporary DNA sequence coding techniques. Its merits lie in its simplicity, visual interpretability and ease of implementation, making it a potential resource in complementing existing strategies in the field.
Publisher
Cold Spring Harbor Laboratory
Reference42 articles.
1. An updated review of sars-cov-2 detection methods in the context of a novel coronavirus pandemic;Bioengineering & Translational Medicine,2023
2. CONSTITUTION OF WHO. Covid-19 epidemiological update. Responding to Community Spread of COVID-19. Reference WHO/COVID-19/Community_Transmission/2020.1, 2023.
3. Perda de bem-estar financeiro na pandemia covid-19: evidências preliminares de um websurvey;Saúde e Pesquisa,2021
4. Effects of strict containment policies on covid-19 pandemic crisis: lessons to cope with next pandemic impacts;Environmental Science and Pollution Research,2023
5. New normal» of students’ educational practices in the coronavirus pandemic reality;High. Educ. Russia,2022