Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model-Reference-Cited by-同舟云学术

Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model

Published:2024-05-22 Issue:1 Volume:16 Page:
ISSN:1758-2946
Container-title:Journal of Cheminformatics
language:en
Short-container-title:J Cheminform

Author:

Chen Hengwei,Bajorath Jürgen

Abstract

Abstract Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications. Scientific contribution The approach introduced herein combines protein language model and chemical language model components, representing an advanced architecture, and is the first methodology for predicting compounds with desired potency from conditioned protein sequence data.

Funder

China Scholarship Council

Rheinische Friedrich-Wilhelms-Universität Bonn

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s13321-024-00852-x.pdf

Reference43 articles.

1. Keserü GM, Makara GM (2009) The influence of lead discovery strategies on the properties of drug candidates. Nat Rev Drug Discov 8:203–212. https://doi.org/10.1038/nrd2796

2. Ferreira LLG, Andricopulo AD (2019) ADMET modeling approaches in Drug Discovery. Drug Discov Today 24:1157–1165. https://doi.org/10.1016/j.drudis.2019.03.015

3. Lewis RA, Wood D (2014) Modern 2D QSAR for drug discovery. WIREs Comput Mol Sci 4:505–522. https://doi.org/10.1002/wcms.1187

4. Muratov EN, Bajorath J, Sheridan RP et al (2020) QSAR without borders. Chem Soc Rev 49:3525–3564. https://doi.org/10.1039/d0cs00098a

5. Vamathevan J, Clark D, Czodrowski P et al (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18:463–477. https://doi.org/10.1038/s41573-019-0024-5