Space-efficient representation of genomic k-mer count tables-Reference-Cited by-同舟云学术

Space-efficient representation of genomic k-mer count tables

Published:2022-03-21 Issue:1 Volume:17 Page:
ISSN:1748-7188
Container-title:Algorithms for Molecular Biology
language:en
Short-container-title:Algorithms Mol Biol

Author:

Shibuya Yoshihiro,Belazzougui Djamal,Kucherov Gregory

Abstract

Abstract Motivation k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. Results In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computational Theory and Mathematics,Molecular Biology,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s13015-022-00212-0.pdf

Reference35 articles.

1. Sims GE, Jun S-R, Wu GA, Kim S-H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci USA. 2009;106(8):2677–82. https://doi.org/10.1073/pnas.0813249106.

2. Yi H, Jin L. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013;41(7):75. https://doi.org/10.1093/nar/gkt003.

3. Dencker T, Leimeister C-A, Gerth M, Bleidorn C, Snir S, Morgenstern B. Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees. In: Blanchette, M., Ouangraoua, A. (eds.) Comparative Genomics. Lecture Notes in Computer Science, 2018;pp. 227–241. Springer, Cham. https://doi.org/10.1007/978-3-030-00834-5_13.

4. Fan H, Ives AR, Surget-Groba Y, Cannon CH. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics. 2015;16(1):522. https://doi.org/10.1186/s12864-015-1647-5.

5. Rahman A, Hallgrímsdóttir I, Eisen M, Pachter L. Association mapping from sequencing reads using k-meres. eLife. 2018;7:32920. https://doi.org/10.7554/eLife.32920.

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Space-efficient computation of k-mer dictionaries for large values of k;Algorithms for Molecular Biology;2024-04-05

2. PreSubLncR: Predicting Subcellular Localization of Long Non-Coding RNA Based on Multi-Scale Attention Convolutional Network and Bidirectional Long Short-Term Memory Network;Processes;2024-03-26

3. Creating and Using Minimizer Sketches in Computational Genomics;Journal of Computational Biology;2023-12-01

4. Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment;PLOS Computational Biology;2023-07-20

5. On weighted k-mer dictionaries;Algorithms for Molecular Biology;2023-06-17