Abstract
AbstractMotivationThe positional Burrows-Wheeler Transform (PBWT) has been introduced as a key data structure for indexing haplotype sequences with the main purpose of finding maximal haplotype matches inhsequences containingwvariation sites in-time with a significant improvement over classical quadratic time approaches. However the original PBWT data structure does not allow queries over the modern biobank panels of haplotypes consisting of several millions of haplotypes, as they must be kept entirely in memory.ResultsIn this paper, we present a method for constructing the run-length encoded PBWT for memory efficient haplotype matching. We implement our method, which we refer to asμ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that theμ-PBWTreduces the memory usage up to a factor of 25 compared to the best current PBWT-based indexing. In particular,μ-PBWTproduces an index that stores high-coverage whole genome sequencing data of chromosome 20 in half the space of its BCF file. In addition,μ-PBWTis able to index a dataset with 2 million haplotypes and 2.3 million sites in 4 GB of space, which can be uploaded in 20 seconds on a commodity laptop.μ-PBWTis an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a small representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.AvailabilityOur implementation is open source and available athttps://github.com/dlcgold/muPBWT. The binary is available athttps://bioconda.github.io/recipes/mupbwt/README.htmlContactPaola Bonizzonipaola.bonizzoni@unimib.it
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Data Structures for SMEM-Finding in the PBWT;String Processing and Information Retrieval;2023