Feature selection and aggregation for antibiotic resistance GWAS in<i>Mycobacterium tuberculosis</i>: a comparative study-Reference-Cited by-同舟云学术

Feature selection and aggregation for antibiotic resistance GWAS inMycobacterium tuberculosis: a comparative study

Published:2022-03-18 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Reshetnikov K.O.^ORCID,Bykova D.I.^ORCID,Kuleshov K.V.^ORCID,Chukreev K.^ORCID,Guguchkin E.P.,Akimkin V.G.^ORCID,Neverov A.D.^ORCID,Fedonin G.G.^ORCID

Abstract

AbstractDrug resistance (DR) remains a global healthcare concern. In contrast to other human bacterial pathogens, acquiring mutations in the genome is the main mechanism of drug resistance forMycobacterium tuberculosis(MTB). For some antibiotics resistance of a particular isolate can be predicted with high confidence knowing whether specific mutations occurred, but for some antibiotics our knowledge of resistance mechanism is moderate. Statistical machine learning (ML) methods are used in attempts to infer new genes implicated in drug resistance. These methods use large collections of isolates with known whole-genome sequences and resistance status for different drugs. However, high correlations between the presence or absence of resistance to drugs that are used together in one treatment regimen complicate inference of causal mutations by traditional ML. Recently, several new methods were suggested to deal with the problem of correlations of response variables in training data. In this study, we applied the following methods to tackle the confounding effect of resistance co-occurrence in a dataset of approximately 13 000 complete genomes of MTB with characterized resistance status for 13 drugs: logistic regression with different regularization penalty functions, a polynomial-time algorithm for best-subset selection problem (ABESS), and “Hungry, Hungry SNPos” (HHS) method. We compared these methods by the ability to select known causal mutations for the resistance to each particular drug and not to select mutations in genes that are known to be associated with resistance to other drugs. ABESS significantly outperformed the others selecting more relevant sets of mutations. We also showed that aggregation of rare mutations into features indicating changes of PFAM domains increased the quality of prediction and these features were majorly selected by ABESS.Impact statementDue to the high significance of the problem, many studies in the recent decade aimed to predict drug susceptibility/resistance of MTB from its genotype. Most of such methods were based on prior biological knowledge, e.g. consideration of mutations occurring in known genes involved in the metabolism of drugs. In our study, we estimated to what extent ML methods could extract de novo biologically relevant associations of mutations with resistance/susceptibility to drugs from large datasets of clinical MTB isolates. As a criterion of accuracy we used the known experimentally verified associations of mutations in MTB genes to corresponding drugs. The most accurate approach from the benchmarked ones addressed the most of these known genes to proper drugs. The result of feature selection was robust despite the presence of population structure with strong phylogenetic and geographic signals in the dataset. Also, we designed an original approach for aggregation of rare mutations and demonstrated that it improved classification accuracies of ML models. To our knowledge, this study is the first comparison of modern feature selection methods applied to genome-wide association studies (GWAS) of MTB drug resistance.Data SummaryThe dataset unifies characterized whole-genome sequences ofM. tuberculosisfrom multiple studies [1–10]. Short Illumina reads are available in public repositories (SRA or ENA). Sample ids, phenotypes and links to the source papers are summarized and listed in Table S1. The dataset and the source code can be downloaded from the GitHub repository:https://github.com/Reshetnikoff/m.tuberculosis-research-code

Publisher

Cold Spring Harbor Laboratory

Reference80 articles.

1. Genome sequencing of 161 Mycobacterium tuberculosis isolates from China identifies genes and intergenic regions associated with drug resistance

2. Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis

3. Genome-wide analysis of multi- and extensively drug-resistant Mycobacterium tuberculosis

4. Transmission of multidrug-resistant Mycobacterium tuberculosis in Shanghai, China: a retrospective observational study using whole-genome sequencing and epidemiological investigation

5. GWAS for quantitative resistance phenotypes in Mycobacterium tuberculosis reveals resistance genes and regulatory regions

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine learning models for Neisseria gonorrhoeae antimicrobial susceptibility tests;Annals of the New York Academy of Sciences;2022-12-27