Affiliation:
1. College of Automation, Hangzhou Dianzi University, Hangzhou, Zhejiang, 310018, PR China
2. Institute of Biomedical Engineering, Hangzhou Dianzi University, Hangzhou, Zhejiang, 310018, PR China
3. NMPA Key Laboratory for Testing and Risk Warning of Pharmaceutical Microbiology, Hangzhou, Zhejiang, 310012, PR China
4. Key Laboratory of Microorganism Technology and Bioinformatics Research of Zhejiang Province, Hangzhou, Zhejiang, 310012, PR China
Abstract
Introduction.
Klebsiella pneumoniae
, a gram-negative bacterium, is a common pathogen causing nosocomial infection. The drug-resistance rate of
K. pneumoniae
is increasing year by year, posing a severe threat to public health worldwide.
K. pneumoniae
has been listed as one of the pathogens causing the global crisis of antimicrobial resistance in nosocomial infections. We need to explore the drug resistance of
K. pneumoniae
for clinical diagnosis. Single nucleotide polymorphisms (SNPs) are of high density and have rich genetic information in whole-genome sequencing (WGS), which can affect the structure or expression of proteins. SNPs can be used to explore mutation sites associated with bacterial resistance.
Hypothesis/Gap Statement. Machine learning methods can detect genetic features associated with the drug resistance of
K. pneumoniae
from whole-genome SNP data.
Aims. This work used Fast Feature Selection (FFS) and Codon Mutation Detection (CMD) machine learning methods to detect genetic features related to drug resistance of
K. pneumoniae
from whole-genome SNP data.
Methods. WGS data on resistance of
K. pneumoniae
strains to four antibiotics (tetracycline, gentamicin, imipenem, amikacin) were downloaded from the European Nucleotide Archive (ENA). Sequence alignments were performed with MUMmer 3 to complete SNP calling using
K. pneumoniae
HS11286 chromosome as the reference genome. The FFS algorithm was applied to feature selection of the SNP dataset. The training set was constructed based on mutation sites with mutation frequency >0.995. Based on the original SNP training set, 70% of SNPs were randomly selected from each dataset as the test set to verify the accuracy of the training results. Finally, the resistance genes were obtained by the CMD algorithm and Venny.
Results. The number of strains resistant to tetracycline, gentamicin, imipenem and amikacin was 931, 1048, 789 and 203, respectively. Machine learning algorithms were applied to the SNP training set and test set, and 28 and 23 resistance genes were predicted, respectively. The 28 resistance genes in the training set included 22 genes in the test set, which verified the accuracy of gene prediction. Among them, some genes (KPHS_35310, KPHS_18220, KPHS_35880, etc.) corresponded to known resistance genes (Eef2, lpxK, MdtC, etc). Logistic regression classifiers were established based on the identified SNPs in the training set. The area under the curves (AUCs) of the four antibiotics was 0.939, 0.950, 0.912 and 0.935, showing a strong ability to predict bacterial resistance.
Conclusion. Machine learning methods can effectively be used to predict resistance genes and associated SNPs. The FFS and CMD algorithms have wide applicability. They can be used for the drug-resistance analysis of any microorganism with genomic variation and phenotypic data. This work lays a foundation for resistance research in clinical applications.
Subject
Microbiology (medical),General Medicine,Microbiology
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献