Abstract
AbstractRecent growth in population scale sequencing initiatives involve both cohort scale and proportion of genome surveyed, with a transition from genotyping arrays to broader genome sequencing approaches. The resulting datasets can be challenging to analyze. Here we introduce DNARecords a novel sparse-compatible format for large scale genetic data. The structure enables integration of complex data types such as medical images and drug structures towards the development of machine learning methods to predict disease risk and drug response. We demonstrate its speed and memory advantages for various genetics analyses. These performance advantages will become more pronounced as it becomes feasible to analyze variants of lower population allele frequencies. Finally, we provide an open-source software plugin, built on top of Hail, to allow researchers to write and read such records as well as a set of examples for how to use them.
Publisher
Cold Spring Harbor Laboratory