Author:
Zhu Kaiyuan,Robinson Welles,Schäffer Alejandro A.,Xu Junyan,Ruppin Eytan,Ergun A. Funda,Ye Yuzhen,Sahinalp S. Cenk
Abstract
AbstractThe identification and quantification of microbial abundance at the species or strain level from sequencing data is crucial for our understanding of human health and disease. Existing approaches for microbial abundance estimation either use accurate but computationally expensive alignment-based approaches for species-level estimation or less accurate but computationally fast alignment-free approaches that fail to classify many reads accurately at the species or strain-level.Here we introduce CAMMiQ, a novel combinatorial solution to the microbial identification and abundance estimation problem, which performs better than the best used tools on simulated and real datasets with respect to the number of correctly classified reads (i.e., specificity) by an order of magnitude and resolves possible mixtures of similar genomes.As we demonstrate, CAMMiQ can better distinguish between single cells deliberately infected with distinct Salmonella strains and sequenced using scRNA-seq reads than alternative approaches. We also demonstrate that CAMMiQ is also more accurate than the best used approaches on a variety of synthetic genomic read data involving some of the most challenging bacterial genomes derived from NCBI RefSeq database; it can distinguish not only distinct species but also closely related strains of bacteria.The key methodological innovation of CAMMiQ is its use of arbitrary length, doubly-unique substrings, i.e. substrings that appear in (exactly) two genomes in the input database, instead of fixed-length, unique substrings. To resolve the ambiguity in the genomic origin of doubly-unique substrings, CAMMiQ employs a combinatorial optimization formulation, which can be solved surprisingly quickly. CAMMiQ’s index consists of a sparsified subset of the shortest unique and doubly-unique substrings of each genome in the database, within a user specified length range and as such it is fairly compact. In short, CAMMiQ offers more accurate genomic identification and abundance estimation than the best used alternatives while using similar computational resources.Availabilityhttps://github.com/algo-cancer/CAMMiQ
Publisher
Cold Spring Harbor Laboratory
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献