Abstract
AbstractGiven the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.
Publisher
Cold Spring Harbor Laboratory
Reference46 articles.
1. R. Chikhi , J. Holub , and P. Medvedev , “Data structures to represent sets of k-long DNA sequences,” arXiv:1903.12312 [cs, q-bio], Mar. 2019.
2. R. S. Harris and P. Medvedev , “Improved Representation of Sequence Bloom Trees,” bioRxiv, 2018.
3. Succinct data structures for assembling large genomes
4. R. Chikhi , A. Limasset , S. Jackman , J. T. Simpson , and P. Medvedev , “On the representation of de Bruijn graphs,” in International conference on Research in computational molecular biology. Springer, 2014, pp. 35–55.
5. Compacting de Bruijn graphs from sequencing data quickly and in low memory
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献