Standardizing chemical compounds with language models-Reference-Cited by-同舟云学术

Standardizing chemical compounds with language models

Published:2023-08-08 Issue:3 Volume:4 Page:035014
ISSN:2632-2153
Container-title:Machine Learning: Science and Technology
language:
Short-container-title:Mach. Learn.: Sci. Technol.

Author:

Cretu Miruna T^ORCID,Toniato Alessandra^ORCID,Thakkar Amol^ORCID,Debabeche Amin A,Laino Teodoro^ORCID,Vaucher Alain C^ORCID

Abstract

Abstract With the growing amount of chemical data stored digitally, it has become crucial to represent chemical compounds accurately and consistently. Harmonized representations facilitate the extraction of insightful information from datasets, and are advantageous for machine learning applications. To achieve consistent representations throughout datasets, one relies on molecule standardization, which is typically accomplished using rule-based algorithms that modify descriptions of functional groups. Here, we present the first deep-learning model for molecular standardization. We enable custom standardization schemes based solely on data, which, as additional benefit, support standardization options that are difficult to encode into rules. Our model achieves over

98 %

accuracy in learning two popular rule-based standardization protocols. We then follow a transfer learning approach to standardize metal-organic compounds (for which there is currently no automated standardization practice), based on a human-curated dataset of 1512 compounds. This model predicts the expected standardized molecular format with a test accuracy of 80.7%. As standardization can be considered, more broadly, a transformation from undesired to desired representations of compounds, the same data-driven architecture can be applied to other tasks. For instance, we demonstrate the application to compound canonicalization and to the determination of major tautomers in solution, based on computed and experimental data.

Funder

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

Publisher

IOP Publishing

Subject

Artificial Intelligence,Human-Computer Interaction,Software

Link

https://iopscience.iop.org/article/10.1088/2632-2153/ace878/pdf

Reference44 articles.

1. Neural-symbolic machine learning for retrosynthesis and reaction prediction;Segler;Eur. J. Chem.,2017

2. A graph-convolutional neural network model for the prediction of chemical reactivity;Coley;Chem. Sci.,2019

3. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction;Schwaller;ACS Cent. Sci.,2019

4. Planning chemical syntheses with deep neural networks and symbolic AI;Segler;Nature,2018

5. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy;Schwaller;Chem. Sci.,2020