A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples-Reference-Cited by-同舟云学术

A bioinformatics pipeline for Mycobacterium tuberculosis sequencing that cleans contaminant reads from sputum samples

Published:2021-10-26 Issue:10 Volume:16 Page:e0258774
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Cuevas-Córdoba Betzaida^ORCID,Fresno Cristóbal^ORCID,Haase-Hernández Joshua I.,Barbosa-Amezcua Martín,Mata-Rocha Minerva,Muñoz-Torrico Marcela,Salazar-Lezama Miguel A.,Martínez-Orozco José A.^ORCID,Narváez-Díaz Luis A.,Salas-Hernández Jorge,González-Covarrubias Vanessa,Soberón Xavier

Abstract

Next-Generation Sequencing (NGS) is widely used to investigate genomic variation. In several studies, the genetic variation of Mycobacterium tuberculosis has been analyzed in sputum samples without previous culture, using target enrichment methodologies for NGS. Alignments obtained by different programs generally map the sequences under default parameters, and from these results, it is assumed that only Mycobacterium reads will be obtained. However, variants of interest microorganism in clinical samples can be confused with a vast collection of reads from other bacteria, viruses, and human DNA. Currently, there are no standardized pipelines, and the cleaning success is never verified since there is a lack of rigorous controls to identify and remove reads from other sputum-microorganisms genetically similar to M. tuberculosis. Therefore, we designed a bioinformatic pipeline to process NGS data from sputum samples, including several filters and quality control points to identify and eliminate non-M. tuberculosis reads to obtain a reliable genetic variant report. Our proposal uses the SURPI software as a taxonomic classifier to filter input sequences and perform a mapping that provides the highest percentage of Mycobacterium reads, minimizing the reads from other microorganisms. We then use the filtered sequences to perform variant calling with the GATK software, ensuring the mapping quality, realignment, recalibration, hard-filtering, and post-filter to increase the reliability of the reported variants. Using default mapping parameters, we identified reads of contaminant bacteria, such as Streptococcus, Rhotia, Actinomyces, and Veillonella. Our final mapping strategy allowed a sequence identity of 97.8% between the input reads and the whole M. tuberculosis reference genome H37Rv using a genomic edit distance of three, thus removing 98.8% of the off-target sequences with a Mycobacterium reads loss of 1.7%. Finally, more than 200 unreliable genetic variants were removed during the variant calling, increasing the report’s reliability.

Funder

CONACyT – FONCICYT-GACD

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference36 articles.

1. Mycobacterium tuberculosis—Heterogeneity revealed through whole genome sequencing;C Ford;Tuberculosis,2012

2. Whole genome sequencing: A new paradigm in the surveillance and control of human tuberculosis;SE Hasnain;Tuberculosis,2015

3. Ten years of next-generation sequencing technology;EL van Dijk;Trends in genetics: TIG,2014

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Targeted next-generation sequencing to diagnose drug-resistant tuberculosis: a systematic review and meta-analysis;The Lancet Infectious Diseases;2024-05

2. Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data;GigaScience;2024

3. Tools for short variant calling and the way to deal with big datasets;Phylogenomics;2024

4. The MAGMA pipeline for comprehensive genomic analyses of clinical Mycobacterium tuberculosis samples;PLOS Computational Biology;2023-11-29

5. The MAGMA pipeline for comprehensive genomic analyses of clinicalMycobacterium tuberculosissamples;2023-10-05