Abstract
AbstractMotivationProtein phosphorylation is a key post-translational modification that plays a central role in many cellular processes. With recent advances in biotechnology, thousands of phosphorylated sites can be identified and quantified in a given sample, enabling proteome-wide screening of cellular signaling. However, the kinase(s) that phosphorylate most (> 90%) of the identified phosphorylation sites are unknown. Knowledge of kinase-substrate associations is also mostly limited to a small number of well-studied kinases, with 20% of known kinases accounting for the phosphorylation of 87% of currently annotated sites. The scarcity of available annotations calls for the development of computational algorithms for more comprehensive and reliable prediction of kinase-substrate associations.ResultsTo broadly utilize available structural, functional, evolutionary, and contextual information in predicting kinase-substrate associations, we develop a network-based machine learning framework. Our framework integrates a multitude of data sources to characterize the landscape of functional relationships and associations among phosphosites and kinases. To construct a phosphosite-phosphosite association network, we use sequence similarity, shared biological pathways, co-evolution, co-occurrence, and co-phosphorylation of phosphosites across different biological states. To construct a kinase-kinase association network, we integrate protein-protein interactions, shared biological pathways, and membership in common kinase families. We use node embeddings computed from these heterogeneous networks to train machine learning models for predicting kinase-substrate associations. Our systematic computational experiments using the PhosphositePLUS database shows that the resulting algorithm, NetKSA, outperforms state-of-the-art algorithms and resources, including KinomeXplorer and LinkPhinder, in reliably predicting KSAs. By stratifying the ranking of kinases, NetKSA also enables annotation of phosphosites that are targeted by relatively less-studied kinases. Finally, we observe that the performance of NetKSA is robust to the choice of network embedding algorithms, while each type of network contributes valuable information that is complementary to the information provided by other networks.ConclusionRepresentation of available functional information on kinases and phosphorylation sites, along with integrative machine learning algorithms, has the potential to significantly enhance our knowledge on kinase-substrate associations.AvailabilityThe code and data are available atcompbio.case.edu/NetKSA.
Publisher
Cold Spring Harbor Laboratory