On Weighted K-Mer Dictionaries-Reference-Cited by-同舟云学术

On Weighted K-Mer Dictionaries

Published:2022-05-24 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Pibiri Giulio Ermanno^ORCID

Abstract

AbstractWe consider the problem of representing a set of k-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a k-mer is efficient. The representation is called a weighted dictionary of k-mers and finds application in numerous tasks in Bioinformatics that usually count k-mers as a pre-processing step. In fact, k-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing.In this work we extend the recently introduced SSHash dictionary (Pibiri, Bioinformatics 2022) to also store compactly the weights of the k-mers. From a technical perspective, we exploit the order of the k-mers represented in SSHash to encode runs of weights, hence allowing (several times) better compression than the empirical entropy of the weights. We also study the problem of reducing the number of runs in the weights to improve compression even further and illustrate a lower bound for this problem. We propose an efficient, greedy, algorithm to reduce the number of runs and show empirically that it performs well, i.e., very similarly to the lower bound. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only k-mer dictionary that is exact, weighted, associative, fast, and small.

Publisher

Cold Spring Harbor Laboratory

Reference38 articles.

1. A space and time-efficient index for the compacted colored de Bruijn graph;Bioinformatics,2018

2. Graphical pan-genome analysis with compressed suffix trees and the Burrows–Wheeler transform

3. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

4. Alexander Bowe , Taku Onodera , Kunihiko Sadakane , and Tetsuo Shibuya . Succinct de Bruijn graphs. In International Workshop on Algorithms in Bioinformatics (WABI), pages 225–235. Springer, 2012.

5. Michael Burrows and David Wheeler . A block-sorting lossless data compression algorithm. In Digital SRC Research Report. Citeseer, 1994.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Fulgor: a fast and compact k-mer index for large-scale matching and color queries;Algorithms for Molecular Biology;2024-01-22

2. Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing;Lecture Notes in Computer Science;2023