On weighted k-mer dictionaries-Reference-Cited by-同舟云学术

On weighted k-mer dictionaries

Published:2023-06-17 Issue:1 Volume:18 Page:
ISSN:1748-7188
Container-title:Algorithms for Molecular Biology
language:en
Short-container-title:Algorithms Mol Biol

Author:

Pibiri Giulio Ermanno

Abstract

AbstractWe consider the problem of representing a set of

$$k$$

k -mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a

$$k$$

k -mer is efficient. The representation is called a weighted dictionary of

$$k$$

k -mers and finds application in numerous tasks in Bioinformatics that usually count

$$k$$

k -mers as a pre-processing step. In fact,

$$k$$

k -mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185–194, 2022) to also store compactly the weights of the

$$k$$

k -mers. From a technical perspective, we exploit the order of the

$$k$$

k -mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only

$$k$$

k -mer dictionary that is exact, weighted, associative, fast, and small.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computational Theory and Mathematics,Molecular Biology,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s13015-023-00226-2.pdf

Reference46 articles.

1. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.

2. Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, et al. Abyss 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome Res. 2017;27(5):768–77.

3. Khorsand P, Hormozdiari F. Nebula: ultra-efficient mapping-free structural variant genotyper. Nucl Acids Res. 2021;49(8):47–47.

4. Standage DS, Brown CT, Hormozdiari F. Kevlar: a mapping-free framework for accurate discovery of de novo variants. Iscience. 2019;18:28–36.

5. Baier U, Beller T, Ohlebusch E. Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics. 2016;32(4):497–504.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Meta-colored compacted de Bruijn graphs;2023-07-25