Abstract
AbstractLow levels of sample contamination with other human DNAs can have disastrous effects on the accurate identification of somatic variation in tumor samples. Detection of sample contamination in DNA is often based on low frequency variants that indicate if more than a single source of DNA is present. This strategy works with standard DNA samples but can be problematic in solid tumor FFPE samples because there are often huge variations in allele frequency (AF) due to copy number changes arising from gains and losses across the genome. The variable AFs make detection of contamination challenging. To avoid this, we counted microhaplotypes to assess sample contamination. Microhaplotypes are sets of variants on the same sequencing read that can be unambiguously phased. Instead of measuring AF, the number of microhaplotypes is determined. Contamination detection becomes based on fundamental genomic properties, linkage disequilibrium (LD) and the diploid nature of human DNA, rather than variant frequencies. We optimized microhaplotype panel content and selected 164 SNV sets located in regions already being sequenced within a cancer panel. Thus, contamination detection uses existing sequence data. LD data from the 1000 Genomes Project is used to make the panel ancestry agnostic, providing the same sensitivity for contamination detection with samples from individuals of African, East Asian, and European ancestry. Detection of 1% contamination with no matching normal sample is possible. The methods described here can also be extended to other DNA mixtures such as forensic and non-invasive prenatal testing samples where DNA mixes can be similarly detected. The microhaplotype method allows sensitive detection of DNA contamination in FFPE tumor and other samples when deep coverage with Illumina or other high accuracy NGS is used.
Publisher
Cold Spring Harbor Laboratory