<i>μ</i>-PBWT: Enabling the Storage and Use of UK Biobank Data on a Commodity Laptop-Reference-Cited by-同舟云学术

μ-PBWT: Enabling the Storage and Use of UK Biobank Data on a Commodity Laptop

Published:2023-02-16 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Cozzi Davide^ORCID,Rossi Massimiliano^ORCID,Rubinacci Simone^ORCID,Köppl Dominik^ORCID,Boucher Christina^ORCID,Bonizzoni Paola^ORCID

Abstract

AbstractMotivationThe positional Burrows-Wheeler Transform (PBWT) has been introduced as a key data structure for indexing haplotype sequences with the main purpose of finding maximal haplotype matches inhsequences containingwvariation sites in

-time with a significant improvement over classical quadratic time approaches. However the original PBWT data structure does not allow queries over the modern biobank panels of haplotypes consisting of several millions of haplotypes, as they must be kept entirely in memory.ResultsIn this paper, we present a method for constructing the run-length encoded PBWT for memory efficient haplotype matching. We implement our method, which we refer to asμ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that theμ-PBWTreduces the memory usage up to a factor of 25 compared to the best current PBWT-based indexing. In particular,μ-PBWTproduces an index that stores high-coverage whole genome sequencing data of chromosome 20 in half the space of its BCF file. In addition,μ-PBWTis able to index a dataset with 2 million haplotypes and 2.3 million sites in 4 GB of space, which can be uploaded in 20 seconds on a commodity laptop.μ-PBWTis an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a small representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel.AvailabilityOur implementation is open source and available athttps://github.com/dlcgold/muPBWT. The binary is available athttps://bioconda.github.io/recipes/mupbwt/README.htmlContactPaola Bonizzonipaola.bonizzoni@unimib.it

Publisher

Cold Spring Harbor Laboratory

Reference27 articles.

1. Bjarni V Halldorsson , Hannes P Eggertsson , Kristjan HS Moore , Hannes Hauswedell , Ogmundur Eiriksson , Magnus O Ulfarsson , Gunnar Palsson , Marteinn T Hardarson , Asmundur Oddsson , Brynjar O Jensson , et al. The sequences of 150,119 genomes in the UK Biobank. Nature, pages 1–9, 2022.

2. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

3. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT)

4. Jasmijn A Baaijens , Paola Bonizzoni , Christina Boucher , Gianluca Della Vedova , Yuri Pirola , Raffaella Rizzi , and Jouni Sirén . Computational graph pangenomics: a tutorial on data structures and their applications. Natural Computing, pages 1–28, 2022.

5. Bayesian inference of phylogenetic networks from bi-allelic genetic markers;PLoS Computational Biology,2018

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data Structures for SMEM-Finding in the PBWT;String Processing and Information Retrieval;2023