Efficient storage and regression computation for population-scale genome sequencing studies-Reference-Cited by-同舟云学术

Efficient storage and regression computation for population-scale genome sequencing studies

Published:2024-04-15 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Rivas Manuel A.,Chang Christopher

Abstract

In the era of big data in human genetics, large-scale biobanks aggregating genetic data from diverse populations have emerged as important for advancing our understanding of human health and disease. However, the computational and storage demands of whole genome sequencing (WGS) studies pose significant challenges, especially for researchers from underfunded institutions or developing countries, creating a disparity in research capabilities. We introduce new approaches that significantly enhance computational efficiency and reduce data storage requirements for WGS studies. By developing algorithms for compressed storage of genetic data, focusing particularly on optimizing the representation of rare variants, and designing regression methods tailored for the scale and complexity of WGS data, we significantly lower computational and storage costs. We integrate our approach into PLINK 2.0. The implementation demonstrates considerable reductions in storage space and computational time without compromising analytical accuracy, as evidenced by the application to the AllofUs project data. We improve runtime of an exome-wide association analysis of 19.4 million variants and a single phenotype from 695.35 minutes (approximately 11.5 hours) on a single machine to 1.57 minutes using 30Gb of memory and 50 threads (8.67 minutes using 4 threads). Similarly, we generalize to multi-phenotype analysis. We anticipate that our approach will enable researchers across the globe to unlock the potential of population biobanks, accelerating the pace of discoveries that can improve our understanding of human health and disease.

Publisher

Cold Spring Harbor Laboratory

Reference21 articles.

1. Genetics of 35 blood and urine biomarkers in the UK Biobank;Nat. Genet,2021

2. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma

3. Akbari, P. et al. Sequencing of 640,000 exomes identifies variants associated with protection from obesity. Science 373, (2021).

4. Genomic data in the All of Us Research Program;All of Us Research Program Genomics Investigators;Nature,2024

5. Prospective study design and data analysis in UK Biobank;Sci. Transl. Med,2024

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data;2024-04-28