Abstract
AbstractThe exponential growth of DNA sequencing data calls for novel space-efficient algorithms for their compression and search. State-of-the-art approaches often usek-merization for data tokenization, yet efficiently representing and queryingk-mer sets remains a significant bioin-formatics challenge. Our recent work introduced the concept of masked superstrings, which compactly representk-mer sets without reliance on common structural assumptions. However, the applicability of masked superstrings for set operations and membership queries remained open. Here, we develop thef-masked superstring framework, which integrates demasking functionsf, enabling efficientk-mer set operations through concatenation. Combined with a tailored version of the FM-index, this framework provides a versatile, compact data structure fork-mer sets. We demonstrate its effectiveness with the FMSI program, which, when evaluated on bacterial pan-genomes, improves space efficiency by a factor of 1.4 to 4.5 compared to leading singlek-mer-set indexing methods such as SSHash and SBWT. Overall, our work highlights the potential off-masked superstrings as a versatile elementary data type fork-mer sets.
Publisher
Cold Spring Harbor Laboratory