Invalid SMILES are beneficial rather than detrimental to chemical language models-Reference-Cited by-同舟云学术

Invalid SMILES are beneficial rather than detrimental to chemical language models

Published:2024-03-29 Issue:4 Volume:6 Page:437-448
ISSN:2522-5839
Container-title:Nature Machine Intelligence
language:en
Short-container-title:Nat Mach Intell

Author:

Skinnider Michael A.^ORCID

Abstract

AbstractGenerative machine learning models have attracted intense interest for their ability to sample novel molecules with desired chemical or biological properties. Among these, language models trained on SMILES (Simplified Molecular-Input Line-Entry System) representations have been subject to the most extensive experimental validation and have been widely adopted. However, these models have what is perceived to be a major limitation: some fraction of the SMILES strings that they generate are invalid, meaning that they cannot be decoded to a chemical structure. This perceived shortcoming has motivated a remarkably broad spectrum of work designed to mitigate the generation of invalid SMILES or correct them post hoc. Here I provide causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models. I show that the generation of invalid outputs provides a self-corrective mechanism that filters low-likelihood samples from the language model output. Conversely, enforcing valid outputs produces structural biases in the generated molecules, impairing distribution learning and limiting generalization to unseen chemical space. Together, these results refute the prevailing assumption that invalid SMILES are a shortcoming of chemical language models and reframe them as a feature, not a bug.

Funder

Ludwig Cancer Research

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s42256-024-00821-x.pdf

Reference90 articles.

1. Reymond, J.-L. The chemical space project. Acc. Chem. Res. 48, 722–730 (2015).

2. Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3–50 (1996).

3. Lipinski, C. & Hopkins, A. Navigating chemical space for biology and medicine. Nature 432, 855–861 (2004).

4. Dobson, C. M. Chemical space and biology. Nature 432, 824–828 (2004).

5. Lameijer, E.-W., Kok, J. N., Bäck, T. & Ijzerman, A. P. The molecule evoluator. An interactive evolutionary algorithm for the design of drug-like molecules. J. Chem. Inf. Model. 46, 545–552 (2006).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Transcriptionally Conditional Recurrent Neural Network for De Novo Drug Design;Journal of Chemical Information and Modeling;2024-07-25

2. Multi-Objective Molecular Design in Constrained Latent Space;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30