Abstract
AbstractWhile aligning sequences has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods have much appeal in terms of simplifying the process of inference, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for some emerging forms of data, such as genome skims, which cannot be assembled. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is that they typically rely on simplified models of sequence evolution such as Jukes-Cantor. It is possible to compute pairwise distances under more complex models by computing frequencies of base substitutions provided that these quantities can be estimated in the alignment-free setting. A particular limitation is that for many forms of genomewide data, which arguably present the best use case for alignment-free methods, the strand of DNA sequences is unknown. Under such conditions, the so-called no-strand bias models are the most complex models that can be used. Here, we show how to calculate distances under a no-strain bias restriction of the General Time Reversible (GTR) model called TK4 without relying on alignments. The method relies on replacing letters in the input sequences, and subsequent computation of Jaccard indices between k-mer sets. For the method to work on large genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that these alignment-free distances can be highly accurate when genomes evolve under the assumed models, and we examine the effectiveness of the method on real genomic data.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献