Abstract
Information theoretic approaches are ubiquitous and effective in a wide variety of bioinformatics applications. In comparative genomics, alignment-free methods, based on short DNA words, or k-mers, are particularly powerful. We evaluated the utility of varying k-mer lengths for genome comparisons by analyzing their sequence space coverage of 5805 genomes in the KEGG GENOME database. In subsequent analyses on four k-mer lengths spanning the relevant range (11, 21, 31, 41), hierarchical clustering of 1634 genus-level representative genomes using pairwise 21- and 31-mer Jaccard similarities best recapitulated a phylogenetic/taxonomic tree of life with clear boundaries for superkingdom domains and high subtree similarity for named taxons at lower levels (family through phylum). By analyzing ~14.2M prokaryotic genome comparisons by their lowest-common-ancestor taxon levels, we detected many potential misclassification errors in a curated database, further demonstrating the need for wide-scale adoption of quantitative taxonomic classifications based on whole-genome similarity.
Publisher
Public Library of Science (PLoS)
Reference99 articles.
1. A mathematical theory of communication.;CE Shannon;Bell Syst Tech J,1948
2. The information content of DNA;LL Gatlin;J Theor Biol,1966
3. The information content of DNA. II;LL Gatlin;J Theor Biol,1968
4. Information theory and biological sequences: insights from an evolutionary perspective;I. Erill;Inf Theory New Res New York Nov Sci Publ,2012
5. Information theory in molecular biology;C. Adami;Phys Life Rev,2004
Cited by
28 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献