Abstract
AbstractMotivationMulti-drug resistant or hetero-resistant Tuberculosis (TB) hinders the successful treatment of TB. Hetero-resistant TB occurs when multiple strains of the TB-causing bacterium with varying degrees of drug susceptibility are present in an individual. Existing studies predicting the proportion and identity of strains in a mixed infection sample rely on a reference database of known strains. A main challenge then is to identifyde novostrains not present in the reference database, while quantifying the proportion of known strains.ResultsWe present Demixer, a probabilistic generative model that uses a combination of reference-based and reference-free techniques to delineate mixed infection strains in whole genome sequencing (WGS) data. Demixer extends a topic model widely used in text mining to represent known mutations and discover novel ones. Parallelization and other heuristics enabled Demixer to process large datasets like CRyPTIC (Comprehensive Resistance Prediction for Tuberculosis: an International Consortium). In both synthetic and experimental benchmark datasets, our proposed method precisely detected the identity (e.g., 91.67% accuracy on the experimentalin vitrodataset) as well as the proportions of the mixed strains. In real-world applications, Demixer revealed novel high confidence mixed infections (101 out of 1,963 Malawi samples analyzed), and new insights into the global frequency of mixed infection (2% at the most stringent threshold in the CRyPTIC dataset) and its significant association to drug resistance. Our approach is generalizable and hence applicable to any bacterial and viral WGS data.AvailabilityAll code relevant to Demixer is available athttps://github.com/BIRDSgroup/Demixer.Contactnmanik@cse.iitm.ac.inSupplementary informationThe Supplemental Data/Result Files related to Demixer are available at this link:https://drive.google.com/drive/folders/13WFACrn2EpeVTO7533-YwlAGjgF4UH3k?usp=drive_link.
Publisher
Cold Spring Harbor Laboratory