Abstract
AbstractThe Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150,000 genome VCF would occupy 900 TiB, making it both costly and complicated to produce and analyze. The issue stems from VCF’s requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR achieves this by adopting reference blocks from the Genomic Variant Call Format (GVCF) and employing local allele indices. SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling.We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail’s native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.
Publisher
Cold Spring Harbor Laboratory
Reference24 articles.
1. All of Us Research Program. Genomic Research Data Quality Report All of Us Curated Data Repository (CDR) release C2022Q4R9. https://support.researchallofus.org/hc/en-us/article_attachments/17973653017236/_QC_Report_v7_release.pdf
2. Felsenfeld A . Centers for Common Disease Genomics. 2018-09-26. Retrieved 2023-11-20. https://www.genome.gov/Funded-Programs-Projects/NHGRI-Genome-Sequencing-Program/Centers-for-Common-Disease-Genomics. https://web.archive.org/web/20230710204657/https://www.genome.gov/Funded-Programs-Projects/NHGRI-Genome-Sequencing-Program/Centers-for-Common-Disease-Genomics.
3. Chao, Katherine , & gnomAD Production Team. “gnomAD v4.0”. 2023-11-01. Retrieved 2023-11-20. https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/. https://web.archive.org/web/20231103034332/ https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/.
4. A genomic mutational constraint map using variation in 76,156 human genomes
5. Twelve years of SAMtools and BCFtools
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献