Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering-Reference-Cited by-同舟云学术

Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering

Published:2020-10-26 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Xu Mengyang^ORCID,Guo Lidong,Shi Chengcheng,Liu Xiaochuan,Chen Jianwei,Liu Xin,Fan Guangyi

Abstract

AbstractDecontamination is necessary for eliminating the effect of foreign genomes on the symbiont studies and biomedical discoveries. However, direct extraction of host sequencing reads with no references remains challenging. Here, we present a triobased method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA or protein references. This method first identifies high-confident host reads by haplotype-specific k-mers inherited from parents, and then groups remaining host reads by the unsupervised clustering. Experimental results demonstrated that this approach successfully classified up to 97.38% of the host human long reads with the precision rate of 99.9999%, and 79.95% host co-barcoded reads with the precision rate of 98.36% using an artificially mixed data. Moreover, the tool also exhibited a good performance on the decontamination of the real algae data. The purified reads reconstructed two haplotypes and improved the assembly with larger contig NGA50 value and less misassemblies. Symbiont-Screener can be freely downloaded at https://github.com/BGI-Qingdao/Symbiont-Screener.

Publisher

Cold Spring Harbor Laboratory

Reference38 articles.

1. No evidence for extensive horizontal gene transfer from the draft genome of a tardigrade

2. Unexpected cross-species contamination in genome sequencing projects

3. Large-scale contamination of microbial isolate genomes by Illumina PhiX control

4. The Integrative Human Microbiome Project

5. The global catalogue of microorganisms 10K type strain sequencing project: closing the genomic gaps for the validly published prokaryotic and fungi species