Comprehensive and accurate genetic variant identification from contaminated and low coverage Mycobacterium tuberculosis whole genome sequencing data-Reference-Cited by-同舟云学术

Comprehensive and accurate genetic variant identification from contaminated and low coverage Mycobacterium tuberculosis whole genome sequencing data

Published:2021-09-16 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Heupink Tim H.^ORCID,Verboven Lennert,Warren Robin M.,Van Rie Annelies

Abstract

AbstractImproved understanding of the genomic variants that allow Mycobacterium tuberculosis (Mtb) to acquire drug resistance, or tolerance, and increase its virulence are important factors in controlling the current tuberculosis epidemic. Current approaches to Mtb sequencing however cannot reveal Mtb’s full genomic diversity due to the strict requirements of low contamination levels, high Mtb sequence coverage, and elimination of complex regions.We developed the XBS (compleX Bacterial Samples) bioinformatics pipeline which implements joint calling and machine-learning-based variant filtering tools to specifically improve variant detection in the important Mtb samples that do not meet these criteria, such as those from unbiased sputum samples. Using novel simulated datasets, that permit exact accuracy verification, XBS was compared to the UVP and MTBseq pipelines. Accuracy statistics showed that all three pipelines performed equally well for sequence data that resemble those obtained from high depth coverage and low-level contamination culture isolates. In the complex genomic regions however, XBS accurately identified 9.0% more single nucleotide polymorphisms and 8.1% more single nucleotide insertions and deletions than the WHO-endorsed unified analysis variant pipeline. XBS also had superior accuracy for sequence data that resemble those obtained directly from sputum samples, where depth of coverage is typically very low and contamination levels are high. XBS was the only pipeline not affected by low depth of coverage (5-10×), type of contamination and excessive contamination levels (>50%). Simulation results were confirmed using WGS data from clinical samples, confirming the superior performance of XBS with a higher sensitivity (98.8%) when analysing culture isolates and identification of 13.9% more variable sites in WGS data from sputum samples as compared to MTBseq, without evidence for false positive variants when ribosomal RNA regions were excluded.The XBS pipeline facilitates sequencing of less-than-perfect Mtb samples. These advances will benefit future clinical applications of Mtb sequencing, especially whole genome sequencing directly from clinical specimens, thereby avoiding in vitro biases and making many more samples available for drug resistance and other genomic analyses. The additional genetic resolution and increased sample success rate will improve genome-wide association studies and sequence-based transmission studies.Impact statementMycobacterium tuberculosis (Mtb) DNA is usually extracted from culture isolates to obtain high quantities of non-contaminated DNA but this process can change the make-up of the bacterial population and is time-consuming. Furthermore, current analytic approaches exclude complex genomic regions where DNA sequences are repeated to avoid inference of false positive genetic variants, which may result in the loss of important genetic information.We designed the compleX Bacterial Sample (XBS) variant caller to overcome these limitations. XBS employs joint variant calling and machine-learning-based variant filtering to ensure that high quality variants can be inferred from low coverage and highly contaminated genomic sequence data obtained directly from sputum samples. Simulation and clinical data analyses showed that XBS performs better than other pipelines as it can identify more genetic variants and can handle complex (low depth, highly contaminated) Mtb samples. The XBS pipeline was designed to analyse Mtb samples but can easily be adapted to analyse other complex bacterial samples.Data summarySimulated sequencing data have been deposited in SRA BioProject PRJNA706121. All detailed findings are available in the Supplementary Material. Scripts for running the XBS variant calling core are available on https://github.com/TimHHH/XBS The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.

Publisher

Cold Spring Harbor Laboratory

Reference28 articles.

1. Whole genome sequencing of Mycobacterium tuberculosis: current standards and open issues;Nat. Rev. Microbiol,2019

2. Identification and characterization of breakthrough contaminants associated with the conventional isolation of Mycobacterium tuberculosis

3. The impact of repeated NALC/NaOH-decontamination on the performance of Xpert MTB/RIF assay;Tuberculosis,2018

4. Bias in detection of Mycobacterium tuberculosis polyclonal infection: Use clinical samples or culturesã;Mol. Cell. Probes,2017

5. Whole genome sequencing Mycobacterium tuberculosis directly from sputum identifies more genetic diversity than sequencing from culture

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Variants in Bedaquiline-Candidate-Resistance Genes: Prevalence in Bedaquiline-Naive Patients, Effect on MIC, and Association with Mycobacterium tuberculosis Lineage;Antimicrobial Agents and Chemotherapy;2022-07-19