Abstract
AbstractO-GlcNAcylation has the potential to be an important target for therapeutics, but a motif or an algorithm to reliably predict O-GlcNAcylation sites is not available. In spite of the importance of O-GlcNAcylation, current predictive models are insufficient as they fail to generalize, and many are no longer available. This article constructs MLP and RNN models to predict the presence of O-GlcNAcylation sites based on protein sequences. Multiple different datasets are evaluated separately and assessed in terms of strengths and issues. The models trained in this work achieve considerably better metrics than previously published models, with at least a two-fold increase in F1score relative to previously published models; the specific gains vary depending on the dataset. Within a given dataset, the results are robust to changes in cross-validation and test data as determined by nested validation. The best model achieves an F1score of 36% (more than 3.5-fold greater than the previous best model) and a Matthews Correlation Coefficient of 35% (more than 4.5-fold greater than the previous best model), and, for the F1score, 7.6-fold higher than when not using any model. Shapley values are used to interpret the model ‘s predictions and provide biological insight into O-GlcNAcylation.
Publisher
Cold Spring Harbor Laboratory