Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model-Reference-Cited by-同舟云学术

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Published:2024-03-20 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Zeng Xiangxiang¹^ORCID,Zhou Peng¹,Wang Jianmin²^ORCID,Li Chunyan³,Wang Zixu⁴,Liu Yiping¹,Sun Siqi⁵^ORCID,Lin Jianxin¹,Wang Longyue⁶

Affiliation:

1. Hunan University

2. Yonsei University

3. Yunnan Normal University

4. University of Tsukuba

5. Fudan University

6. Tencent AI Lab

Abstract

Abstract While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 88.08%, 65.27%, and 61.44%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.

Publisher

Research Square Platform LLC

Reference46 articles.

1. Daniel, and Rafael Gómez-Bombarelli. "Generative models for automatic chemical design;Schwalbe-Koda;Machine Learning Meets Quantum Physics,2020

2. "Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning;Gainza Pablo;Nature Methods,2020

3. "Development of a protein–ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions;Wójcikowski Maciej;Bioinformatics,2019

4. "Elucidating the multiple roles of hydration for accurate protein-ligand binding prediction via deep learning;Mahmoud Amr H;Communications Chemistry,2020

5. "Improved protein–ligand binding affinity prediction with structure-based deep fusion inference;Jones Derek;Journal of chemical information and modeling,2021