Same-Species Contamination Detection with Variant Calling Information from Next Generation Sequencing-Reference-Cited by-同舟云学术

Same-Species Contamination Detection with Variant Calling Information from Next Generation Sequencing

Published:2019-01-26 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Jiang Tao,Buchkovich Martin,Motsinger-Reif Alison

Abstract

AbstractMotivationSame-species contamination detection is an important quality control step in genetic data analysis. Compared with widely discussed cross-species contamination, same-species contamination is more challenging to detect, and there is a scarcity of methods to detect and correct for this quality control issue. Same-species contamination may be due to contamination by lab technicians or samples from other contributors. Here, we introduce a novel machine learning algorithm to detect same species contamination in next generation sequence data using support vector machines. Our approach uniquely detects such contamination using variant calling information stored in the variant call format (VCF) files (either DNA or RNA), and importantly can differentiate between same species contamination and mixtures of tumor and normal cells.MethodsIn the first stage of our approach, a change-point detection method is used to identify copy number variations or copy number aberrations (CNVs or CNAs) for filtering prior to testing for contamination. Next, single nucleotide polymorphism (SNP) data is used to test for same species contamination using a support vector machine model. Based on the assumption that alternative allele frequencies in next generation sequencing follow the beta-binomial distribution, the deviation parameter ρ is estimated by maximum likelihood method. All features of a radial basis function (RBF) kernel support vector machine (SVM) are generated using either publicly available or private training data. Lastly, the generated SVM is applied in the test data to detect contamination. If training data is not available, a default RBF kernel SVM model is used.ResultsWe demonstrate the potential of our approach using simulation experiments, creating datasets with varying levels of contamination. The datasets combine, in silico, exome sequencing data of DNA from two lymphoblastoid cell lines (NA12878 and NA10855). We generated VCF files using variants identified in these data, and then evaluated the power and false positive rate of our approach to detect same species contamination. Our simulation experiments show that our method can detect levels of contamination as low as 5% with reasonable false positive rates. Results in real data have sensitivity above 99.99% and specificity at 90.24%, even in the presence of DNA degradation that has similar features to contaminated samples. Additionally, the approach can identify the difference between mixture of tumor-normal cells and contamination. We provide an R software implementation of our approach using the defcon()function in the vanquish: Variant Quality Investigation Helper R package on CRAN.

Publisher

Cold Spring Harbor Laboratory

Reference33 articles.

1. Conpair: concordance and contamination estimator for matched tumor–normal pairs

2. Conpair: concordance and contamination estimator for matched tumor–normal pairs

3. A Limited Memory Algorithm for Bound Constrained Optimization;SIAM Journal on Scientific Computing,1995

4. A uniform survey of allele-specific binding and expression over 1000-Genomes-Project individuals;Nature Communications,2016

5. ContEst: estimating cross-contamination of human samples in next-generation sequencing data

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A comprehensive performance evaluation, comparison, and integration of computational methods for detecting and estimating cross-contamination of human samples in cancer next-generation sequencing analysis;Journal of Biomedical Informatics;2024-04

2. Retraction: Teixeira et al. RADseq Data Suggest Occasional Hybridization between Microcebus murinus and M. ravelobensis in Northwestern Madagascar. Genes 2022, 13, 913;Genes;2022-11-18

3. VCFcontam: A Machine Learning Approach to Estimate Cross-Sample Contamination from Variant Call Data;2021-03-12

4. read_haps: using read haplotypes to detect same species contamination in DNA sequences;2020-02-12