Spectrum Preserving Tilings Enable Sparse and Modular Reference Indexing

Author:

Fan Jason,Khan Jamshed,Pibiri Giulio Ermanno,Patro Rob

Abstract

AbstractThe reference indexing problem for $$k$$-mers is to pre-process a collection of reference genomic sequences $$\mathcal {R}$$ so that the position of all occurrences of any queried $$k$$-mer can be rapidly identified. An efficient and scalable solution to this problem is fundamental for many tasks in bioinformatics.In this work, we introduce the spectrum preserving tiling (SPT), a general representation of $$\mathcal {R}$$ that specifies how a set of tiles repeatedly occur to spell out the constituent reference sequences in $$\mathcal {R}$$. By encoding the order and positions where tiles occur, SPTs enable the implementation and analysis of a general class of modular indexes. An index over an SPT decomposes the reference indexing problem for $$k$$-mers into: (1) a $$k$$-mer-to-tile mapping; and (2) a tile-to-occurrence mapping. Recently introduced work to construct and compactly index $$k$$-mer sets can be used to efficiently implement the $$k$$-mer-to-tile mapping. However, implementing the tile-to-occurrence mapping remains prohibitively costly in terms of space. As reference collections become large, the space requirements of the tile-to-occurrence mapping dominates that of the $$k$$-mer-to-tile mapping since the former depends on the amount of total sequence while the latter depends on the number of unique $$k$$-mers in $$\mathcal {R}$$.To address this, we introduce a class of sampling schemes for SPTs that trade off speed to reduce the size of the tile-to-reference mapping. We implement a practical index with these sampling schemes in the tool . When indexing over 30,000 bacterial genomes, reduces the size of the tile-to-occurrence mapping from 86.3 GB to 34.6 GB while incurring only a 3.6$$\times $$ slowdown when querying $$k$$-mers from a sequenced readset.Availability: is implemented in Rust and available at https://github.com/COMBINE-lab/pufferfish2.

Publisher

Springer Nature Switzerland

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3