Fulgor: A fast and compact<i>k</i>-mer index for large-scale matching and color queries-Reference-Cited by-同舟云学术

Fulgor: A fast and compactk-mer index for large-scale matching and color queries

Published:2023-05-11 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Fan Jason^ORCID,Singh Noor Pratap,Khan Jamshed^ORCID,Pibiri Giulio Ermanno^ORCID,Patro Rob^ORCID

Abstract

AbstractThe problem of sequence identification or matching — determining the subset of references from a given collection that are likely to contain a query nucleotide sequence — is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into anindexfor fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections.To solve this problem, we describe how recent advancements in associative, order-preserving,k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compactcolored de Bruijngraph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph aremonochromatic(allk-mers in a unitig have the same set of references of origin, or “color”), leveraging theorder-preservingproperty of its dictionary. In fact,k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map fromk-mers to their inverted lists in as little as 1 +o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space.We implement these methods in a tool calledFulgor. Compared toThemisto, the prior state of the art,Fulgorindexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000Salmonella entericagenomes in approximately 2 × less space, is at least twice as fast for color queries, and is 2 − 6× faster to construct.2012 ACM Subject ClassificationApplied computing → Bioinformatics

Publisher

Cold Spring Harbor Laboratory

Reference44 articles.

1. Jarno N Alanko , Simon J Puglisi , and Jaakko Vuohtoniemi . Succinct k-mer sets using subset rank queries on the spectral burrows-wheeler transform. bioRxiv, pages 2022–05, 2022.

2. Jarno N Alanko , Jaakko Vuohtoniemi , Tommi Mäklin , and Simon J Puglisi . Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. bioRxiv, pages 2023–02, 2023.

3. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search

4. Individuals with autism spectrum disorder have altered visual encoding capacity

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species;Computational and Structural Biotechnology Journal;2024-12

2. Cdbgtricks: Strategies to update a compacted de Bruijn graph;2024-05-28

3. Meta-colored Compacted de Bruijn Graphs;Lecture Notes in Computer Science;2024

4. Movi: a fast and cache-efficient full-text pangenome index;2023-11-05

5. Meta-colored compacted de Bruijn graphs;2023-07-25