Author:
Ji Boya,Pi Wending,Zhang Xianglilan,Peng Shaoliang
Abstract
AbstractInfectious diseases, particularly bacterial infections, are emerging at an unprecedented rate, posing a serious challenge to public health and the global economy. Different virulence factors (VFs) work in concert to enable pathogenic bacteria to successfully adhere, reproduce and cause damage to host cells, and antibiotic resistance genes (ARGs) allow pathogens to evade otherwise curable treatments. To understand the causal relationship between microbiome composition, function and disease, both VFs and ARGs in microbial data must be identified. Most existing computational models cannot simultaneously identify VFs or ARGs, hindering the related research. The best hit approaches are currently the main tools to identify VFs and ARGs concurrently; yet they usually have high false-negative rates and are very sensitive to the cut-off thresholds. In this work, we proposed a hybrid computational framework called HyperVR to predict VFs and ARGs at the same time. Specifically, HyperVR integrates key genetic features and then stacks classical ensemble learning methods and deep learning for training and prediction. HyperVR accurately predicts VFs, ARGs and negative genes (neither VFs nor ARGs) simultaneously, with both high precision (>0.91) and recall (>0.91) rates. Also, HyperVR keeps the flexibility to predict VFs or ARGs individually. Regarding novel VFs and ARGs, the VFs and ARGs in metagenomic data, and pseudo VFs and ARGs (gene fragments), HyperVR has shown good prediction, outperforming the current state-of-the-art predition tools and best hit approaches in terms of precision and recall. HyperVR is a powerful tool for predicting VFs and ARGs simultaneously by using only gene sequences and without strict cut-off thresholds, hence making prediction straightforward and accurate.
Publisher
Cold Spring Harbor Laboratory