Abstract
AbstractTo reduce the cost of experimental characterization of the potential substrates for enzymes, machine learning prediction model offer an alternative solution. Pretrained language models, as powerful approaches for protein and molecule representation, have been employed in the development of enzyme-substrate prediction models, achieving promising performance. In addition to continuing improvements in language models, effectively fusing encoders to handle multimodal prediction tasks is critical for further enhancing model performance using available representation methods. Here, we present CLR_ESP, a multimodal classifier that integrates protein and chemistry language models with a newly designed contrastive learning strategy for predicting enzyme-substrate pairs. Our best model achieved SOTA performance with an accuracy of 94.70% on independent test data while requiring fewer computational resources and training data. It also confirmed our hypothesis that embeddings of positive pairs are closer to each other in high-dimension space, while negative pairs exhibit the opposite trend. The proposed architecture is expected to be further applied to enhance performance in additional multimodality prediction tasks in biology. A user-friendly web server of CLR_ESP is established and freely accessible athttps://78k6imn5wp.us-east-1.awsapprunner.com/.
Publisher
Cold Spring Harbor Laboratory