Affiliation:
1. Department of Mathematical Sciences, Tsinghua University, Beijing, China
2. Beijing Electronic Science and Technology Institute, Beijing, China
3. Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
Abstract
Background
The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets.
Methods
We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences.
Results
First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.
Funder
National Natural Science Foundation of China (NSFC) Grant
Tsinghua University Spring Breeze Fund
Tsinghua University start-up fund
Tsinghua University Education Foundation fund
Subject
General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience
Reference32 articles.
1. Sequence analysis by iterated maps, a review;Almeida;Briefings in Bioinformatics,2014
2. Basic local alignment search tool;Altschul;Journal of Molecular Biology,1990
3. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs;Altschul;Nucleic Acids Research,1997
4. In giant virus genes, hints about their mysterious origin;Bichell,2017
5. Chlamydia-like obligate parasite of free-living Amoebae;Birtles;The Lancet,1997
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献