The MetaGens algorithm for metagenomic database lossy compression and subject alignment-Reference-Cited by-同舟云学术

The MetaGens algorithm for metagenomic database lossy compression and subject alignment

Published:2023-01-01 Issue: Volume:2023 Page:
ISSN:1758-0463
Container-title:Database
language:en
Short-container-title:

Author:

Cervi Gustavo Henrique¹^ORCID,Flores Cecilia Dias¹,Thompson Claudia Elizabeth¹

Affiliation:

1. Graduate Program in Health Sciences, Universidade Federal de Ciências da Saúde de Porto Alegre (UFCSPA) , Rua Sarmento Leite, 245 - Centro Histórico, Porto Alegre, RS 90050-170, Brazil

Abstract

Abstract The advancement of genetic sequencing techniques led to the production of a large volume of data. The extraction of genetic material from a sample is one of the early steps of the metagenomic study. With the evolution of the processes, the analysis of the sequenced data allowed the discovery of etiological agents and, by corollary, the diagnosis of infections. One of the biggest challenges of the technique is the huge volume of data generated with each new technology developed. To introduce an algorithm that may reduce the data volume, allowing faster DNA matching with the reference databases. Using techniques like lossy compression and substitution matrix, it is possible to match nucleotide sequences without losing the subject. This lossy compression explores the nature of DNA mutations, insertions and deletions and the possibility that different sequences are the same subject. The algorithm can reduce the overall size of the database to 15% of the original size. Depending on parameters, it may reduce up to 5% of the original size. Although is the same as the other platforms, the match algorithm is more sensible because it ignores the transitions and transversions, resulting in a faster way to obtain the diagnostic results. The first experiment results in an increase in speed 10 times faster than Blast while maintaining high sensitivity. This performance gain can be extended by combining other techniques already used in other studies, such as hash tables. Database URL https://github.com/ghc4/metagens

Funder

Conselho Nacional de Desenvolvimento Científico e Tecnológico

Publisher

Oxford University Press (OUP)

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,Information Systems

Link

https://academic.oup.com/database/article-pdf/doi/10.1093/database/baad053/51472423/baad053.pdf

Reference25 articles.

1. Bioinformatics for whole-genome shotgun sequencing of microbial communities;Chen;PLoS Comput. Biol.,2005

2. Metagenomics versus Moore’s law;Editorial;Nat. Methods,2009

3. Size does matter: application-driven approaches for soil metagenomics;Kakirde;Soil Biol. Biochem.,2010

4. Clinical metagenomics;Chiu;Nat. Rev. Genet.,2019