Abstract
AbstractMany pathogenic bacteria use type IV secretion systems(T4SSs) to deliver effectors (T4SEs) into the cytoplasm of eukaryotic cells, causeing diseases. The identification of effectors is a crucial step in understanding the mechanisms of bacterial pathogenicity, but this remains a major challenge. In this study, we used the full-length embedding features generated by six pre-trained protein language models to train classifiers predicting T4SEs, and compared their performance. An integrated model T4SEpp was assembled by a module searching full-length, signal sequence and effector domain homologs of known T4SEs, a machine learning module based on the hand-crafted features extracted from the signal sequences, and the third module containing three best-performing protein language pre-trained models. T4SEpp outperformed the other state-of-the-art (SOTA) software tools, achieving ∼0.95 sensitivity at a high specificity of ∼0.99, based on the assessment of an independent testing dataset. Additionally, we performed a comprehensive search among 8,761 bacterial species, leading to the discovery of 227 species belonging to 3 phyla and 117 genera that possess T4SSs. Furthermore, leveraging the power of T4SEpp, we successfully identified a grand total of 12,622 plausible T4SEs. Overall, T4SEpp provides a better solution to assist in the identification of bacterial T4SEs, and facilitates studies of bacterial pathogenicity. T4SEpp is freely accessible athttps://bis.zju.edu.cn/T4SEpp.
Publisher
Cold Spring Harbor Laboratory