Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for<i>k</i>-Mer Sets-Reference-Cited by-同舟云学术

Function-Assigned Masked Superstrings as a Versatile and Compact Data Type fork-Mer Sets

Published:2024-03-11 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Sladký Ondřej^ORCID,Veselý Pavel^ORCID,Břinda Karel^ORCID

Abstract

AbstractThe exponential growth of DNA sequencing data calls for novel space-efficient algorithms for their compression and search. State-of-the-art approaches often usek-merization for data tokenization, yet efficiently representing and queryingk-mer sets remains a significant bioin-formatics challenge. Our recent work introduced the concept of masked superstrings, which compactly representk-mer sets without reliance on common structural assumptions. However, the applicability of masked superstrings for set operations and membership queries remained open. Here, we develop thef-masked superstring framework, which integrates demasking functionsf, enabling efficientk-mer set operations through concatenation. Combined with a tailored version of the FM-index, this framework provides a versatile, compact data structure fork-mer sets. We demonstrate its effectiveness with the FMSI program, which, when evaluated on bacterial pan-genomes, improves space efficiency by a factor of 1.4 to 4.5 compared to leading singlek-mer-set indexing methods such as SSHash and SBWT. Overall, our work highlights the potential off-masked superstrings as a versatile elementary data type fork-mer sets.

Publisher

Cold Spring Harbor Laboratory

Reference64 articles.

1. Small Searchable κ-Spectra via Subset Rank Queries on the Spectral Burrows-Wheeler Transform

2. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

3. An incrementally updatable and scalable system for large-scale sequence search using the Bentley–Saxe transformation

4. A space and time-efficient index for the compacted colored de Bruijn graph

5. COBS: A Compact Bit-Sliced Signature Index