DNARecords: An extensible sparse format for petabyte scale genomics analysis-Reference-Cited by-同舟云学术

DNARecords: An extensible sparse format for petabyte scale genomics analysis

Published:2022-08-15 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Manas Andres,Seninge Lucas^ORCID,Dixit Atray

Abstract

AbstractRecent growth in population scale sequencing initiatives involve both cohort scale and proportion of genome surveyed, with a transition from genotyping arrays to broader genome sequencing approaches. The resulting datasets can be challenging to analyze. Here we introduce DNARecords a novel sparse-compatible format for large scale genetic data. The structure enables integration of complex data types such as medical images and drug structures towards the development of machine learning methods to predict disease risk and drug response. We demonstrate its speed and memory advantages for various genetics analyses. These performance advantages will become more pronounced as it becomes feasible to analyze variants of lower population allele frequencies. Finally, we provide an open-source software plugin, built on top of Hail, to allow researchers to write and read such records as well as a set of examples for how to use them.

Publisher

Cold Spring Harbor Laboratory

Reference12 articles.

1. The UK Biobank resource with deep phenotyping and genomic data

2. The “All of Us” Research Program

3. Hail Team, “Hail.”

4. Computationally efficient whole-genome regression for quantitative and binary traits;Nat. Genet,2021

5. J. Freudenthal , M. Ankenbrand , D. Grimm , and A. Korte , “GWAS-Flow: A GPU accelerated framework for efficient permutation based genome-wide association studies,” bioRxiv, 2019.