Representation of k-mer sets using spectrum-preserving string sets-Reference-Cited by-同舟云学术

Representation of k-mer sets using spectrum-preserving string sets

Published:2020-01-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Rahman Amatur^ORCID,Medvedev Paul

Abstract

AbstractGiven the popularity and elegance of k-mer based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which we show can store a set of k-mers using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which we show improves index size by 10-44% compared to other state-of-the-art low memory indices. Our tool is publicly available at: https://github.com/medvedevgroup/UST/.

Publisher

Cold Spring Harbor Laboratory

Reference46 articles.

1. R. Chikhi , J. Holub , and P. Medvedev , “Data structures to represent sets of k-long DNA sequences,” arXiv:1903.12312 [cs, q-bio], Mar. 2019.

2. R. S. Harris and P. Medvedev , “Improved Representation of Sequence Bloom Trees,” bioRxiv, 2018.

3. Succinct data structures for assembling large genomes

4. R. Chikhi , A. Limasset , S. Jackman , J. T. Simpson , and P. Medvedev , “On the representation of de Bruijn graphs,” in International conference on Research in computational molecular biology. Springer, 2014, pp. 35–55.

5. Compacting de Bruijn graphs from sequencing data quickly and in low memory

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data Structures to Represent a Set of k -long DNA Sequences;ACM Computing Surveys;2021-04

2. Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation;2020-11-16

3. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets;2020-03-30

4. Simplitigs as an efficient and scalable representation of de Bruijn graphs;2020-01-12

5. Efficient exact associative structure for sequencing data;2019-02-11