Abstract
In this paper, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for representingk-mer sets. Originating from Conway and Bromage’s concept, CBL innovatively employs the smallest cyclic rotations ofk-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano’s scheme. This structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and compression. Our findings suggest that CBL outperforms existing dynamick-mer set methods. Unique to this work, CBL stands out as the only known exactk-mer structure offering in-place set operations. Its different combined abilities position it as a flexible Swiss knife structure fork-mer set management. Availability:https://github.com/imartayan/CBL
Publisher
Cold Spring Harbor Laboratory
Reference46 articles.
1. Comparing methods for constructing and representing human pangenome graphs;Genome Biology,2023
2. Data structures to represent a set of k-long dna sequences;ACM Computing Surveys (CSUR),2021
3. Ondřej Sladký , Pavel Veselý , and Karel Břinda . Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv, pages 2023–02, 2023.
4. Succinct data structures for assembling large genomes
5. Yoshihiro Shibuya , Djamal Belazzougui , and Gregory Kucherov . Efficient reconciliation of genomic datasets of high similarity. bioRxiv, pages 2022–06, 2022.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献