Sparse and skew hashing of K-mers-Reference-Cited by-同舟云学术

Sparse and skew hashing of K-mers

Published:2022-06-24 Issue:Supplement_1 Volume:38 Page:i185-i194
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Pibiri Giulio Ermanno¹

Affiliation:

1. ISTI-CNR , Pisa 56124, Italy

Abstract

Abstract Motivation A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings—in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. Results To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. Availability and implementation https://github.com/jermp/sshash. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

MobiDataLab

OK-INSAID

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/article-pdf/38/Supplement_1/i185/49887045/btac245.pdf

Reference36 articles.

1. A space and time-efficient index for the compacted colored de Bruijn graph;Almodaresi;Bioinformatics,2018

2. Simplitigs as an efficient and scalable representation of de Bruijn graphs;Břinda;Genome Biol,2021

Cited by 33 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species;Computational and Structural Biotechnology Journal;2024-12

2. A survey of k-mer methods and applications in bioinformatics;Computational and Structural Biotechnology Journal;2024-12

3. Where the patterns are: repetition-aware compression for colored de Bruijn graphs^⋆;2024-07-13

4. Conway–Bromage–Lyndon (CBL): an exact, dynamic representation of k-mer sets;Bioinformatics;2024-06-28

5. The mod-minimizer: a simple and efficient sampling algorithm for longk-mers;2024-05-31