Abstract
AbstractAn important problem in metagenomic data analysis is to identify the source organism, or at least taxon, of each sequence. Most methods tackle this problem in two steps by using an alignment-free approach: first the DNA sequences are represented as points of a real n-dimensional space via a mapping function then either clustering or classification algorithms are applied. Those mapping functions require to be genomic signatures: the dissimilarity between the mapped points must reflect the degree of phylogenetic similarity of the source species. Designing good signatures for metagenomics can be challenging due to the special characteristics of metagenomic sequences; most of the existing signatures were not designed accordingly and they were tested only on error-free sequences sampled from a few dozens of species.In this work we analyze comparatively the goodness of existing and novel signatures based on tetranu-cleotide frequencies via statistical models and computational experiments; we also study how they are affected by the generalized Chargaff’s second parity rule (GCSPR), which states that in a given sequence longer than 50kbp, inverse oligonucleotides are approximately equally frequent. We analyze 38 million sequences of 150 bp-1,000 bp with 1% base-calling error, sampled from 1,284 microbes. Our models indicate that GCSPR reduces strand-dependence of signatures, that is, their values are less affected by the source strand; GCSPR is further exploited by some signatures to reduce the intra-species dispersion. Two novel signatures stand out both in the models and in the experiments: the combination signature and the operation signature. The former achieves strand-independence without grouping oligonucleotides; this could be valuable for alignment-free sequence comparison methods when distinguishing inverse oligonucleotides matters. Operation signature sums the frequencies of reverse, complement, and inverse tetranucleotides; having 72 features it reduces the computational intensity of the analysis.
Publisher
Cold Spring Harbor Laboratory
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Nucleotide tetramers TCGA and CTAG: viral DNA and the genetic code (hypothesis);Journal of microbiology, epidemiology and immunobiology;2022-09-25
2. Tetranucleotide Profile of Herpesvirus DNA;Journal of microbiology, epidemiology and immunobiology;2020-06-25