Abstract
AbstractThe reference indexing problem fork-mers is to pre-process a collection of reference genomic sequencesℛso that the position of all occurrences of any queriedk-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics.In this work, we introduce thespectrum preserving tiling(SPT), a general representation ofℛthat specifies how a set oftilesrepeatedly occur tospellout the constituent reference sequences inℛ. By encoding the order and positions wheretilesoccur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem fork-mers into: (1) ak-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly indexk-mer sets can be used to efficiently implement thek-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of thek-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of uniquek-mers inℛ.To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the toolpufferfish2. When indexing over 30,000 bacterial genomes,pufferfish2reduces the size of the tile-to-occurrence mapping from 86.3GB to 34.6GB while incurring only a 3.6× slowdown when queryingk-mers from a sequenced readset.Supplementary materialsSections S.1 to S.8 available online athttps://doi.org/10.5281/zenodo.7504717Availabilitypufferfish2is implemented in Rust and available athttps://github.com/COMBINE-lab/pufferfish2.
Publisher
Cold Spring Harbor Laboratory
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献