Where the patterns are: repetition-aware compression for colored de Bruijn graphs<sup>⋆</sup>-Reference-Cited by-同舟云学术

Where the patterns are: repetition-aware compression for colored de Bruijn graphs^⋆

Published:2024-07-13 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Campanelli Alessio,Pibiri Giulio Ermanno^ORCID,Fan Jason,Patro Rob^ORCID

Abstract

AbstractWe describe lossless compressed data structures for thecoloredde Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map fromk-mers to theircolor sets. The color set of ak-mer is the set of all identifiers, orcolors, of the references that contain thek-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.SoftwareThe implementation of the indexes used for all experiments in this work is written inC++17and is available athttps://github.com/jermp/fulgor.

Publisher

Cold Spring Harbor Laboratory

Reference49 articles.

1. Alanko, J.N. : 3682 E. Coli assemblies from NCBI x(2022), https://zenodo.org/records/6577997

2. Alanko, J.N. , Puglisi, S.J. , Vuohtoniemi, J. : Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform. SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23) pp. 225–236 (2023)

3. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

4. An incrementally updatable and scalable system for large-scale sequence search using the Bentley–Saxe transformation

5. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search