Affiliation:
1. Pennsylvania State University
2. The Pennsylvania State University
Abstract
Abstract
Quantification of enzymatic activities still heavily relies on experimental assays, which can be expensive and time-consuming. Therefore, methods that enable accurate predictions of enzyme activity can serve as effective digital twins. A few recent studies have shown the possibility of training machine learning (ML) models for predicting the enzyme turnover numbers (kcat) and Michaelis constants (Km) using only features derived from enzyme sequences and substrate chemical topologies by training on in vitro measurements. However, several challenges remain such as lack of standardized training datasets, evaluation of predictive performance on out-of-distribution examples, and model uncertainty quantification. Here, we introduce CatPred, a comprehensive framework for ML prediction of in vitro enzyme kinetics. We explored different learning architectures and feature representations for enzymes including those utilizing pretrained protein language model features and pretrained three-dimensional structural features. We systematically evaluate the performance of trained models for predicting kcat, Km, and inhibition constants (Ki) of enzymatic reactions on held-out test sets with a special emphasis on out-of-distribution test samples (corresponding to enzyme sequences dissimilar from those encountered during training). CatPred assumes a probabilistic regression approach offering query-specific standard deviation and mean value predictions. Results on unseen data confirm that accuracy in enzyme parameter predictions made by CatPred positively correlate with lower predicted variances. Incorporating pre-trained language model features is found to be enabling for achieving robust performance on out-of-distribution samples. Test evaluations on both held-out and out-of-distribution test datasets confirm that CatPred performs at least competitively with existing methods while simultaneously offering robust uncertainty quantification. CatPred offers wider scope and larger data coverage (~ 23k, 41k, 12k data-points respectively for kcat, Km and Ki). A web-resource to use the trained models is made available at: https://tiny.cc/catpred
Publisher
Research Square Platform LLC
Reference58 articles.
1. UniProt: the Universal Protein Knowledgebase in 2023;Bateman A;Nucleic Acids Res,2023
2. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW, GenBank (2009) Nucleic Acids Res 38:D46–D51
3. Using deep learning to annotate the protein universe;Bileschi ML;Nat Biotechnol,2022
4. ProteInfer, deep neural networks for protein functional inference;Sanderson T;Elife,2023
5. Enzyme function prediction using contrastive learning;Yu T;Sci (1979),2023
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献