CHARR efficiently estimates contamination from DNA sequencing data
Author:
Lu WenhanORCID, Gauthier Laura D., Poterba Timothy, Giacopuzzi Edoardo, Goodrich Julia K., Stevens Christine R.ORCID, King Daniel, Daly Mark J.ORCID, Neale Benjamin M.ORCID, Karczewski Konrad J.
Abstract
AbstractDNA sample contamination is a major issue in clinical and research applications of whole genome and exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a new metric to estimate DNA sample contamination from variant-level whole genome and exome sequence data, CHARR, Contamination from Homozygous Alternate Reference Reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VDS format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole genome and exome sequencing datasets.
Publisher
Cold Spring Harbor Laboratory
Reference11 articles.
1. Conpair: concordance and contamination estimator for matched tumor–normal pairs 2. Bergström, A. , McCarthy, S. A. , Hui, R. , Almarri, M. A. , Ayub, Q. , Danecek, P. , Chen, Y. , Felkel, S. , Hallast, P. , Kamm, J. , Blanché, H. , Deleuze, J.-F. , Cann, H. , Mallick, S. , Reich, D. , Sandhu, M. S. , Skoglund, P. , Scally, A. , Xue, Y. , …Tyler-Smith, C. (2020). Insights into human genetic variation and population history from 929 diverse genomes. Science, 367(6484). https://doi.org/10.1126/science.aay5012 3. Chen, S. , Francioli, L. C. , Goodrich, J. K. , Collins, R. L. , Kanai, M. , Wang, Q. , Alföldi, J. , Watts, N. A. , Vittal, C. , Gauthier, L. D. , Poterba, T. , Wilson, M. W. , Tarasova, Y. , Phu, W. , Yohannes, M. T. , Koenig, Z. , Farjoun, Y. , Banks, E. , Donnelly, S. , …Karczewski, K. J. (2022). A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. In bioRxiv (p. 2022.03.20.485034). https://doi.org/10.1101/2022.03.20.485034 4. ContEst: estimating cross-contamination of human samples in next-generation sequencing data 5. Hail Team . (2023). Hail 0.2.106-a6c75d687a19. https://github.com/hail-is/hail/commit/a6c75d687a19
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|