Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization-Reference-Cited by-同舟云学术

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Published:2023-05-29 Issue:1 Volume:15 Page:
ISSN:1758-2946
Container-title:Journal of Cheminformatics
language:en
Short-container-title:J Cheminform

Author:

Ucak Umit V.,Ashyrmamatov Islambek,Lee Juyong

Abstract

AbstractTokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.

Funder

Korea Environmental Industry and Technology Institute

Ministry of Education

National Research Foundation of Korea

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Computer Graphics and Computer-Aided Design,Physical and Theoretical Chemistry,Computer Science Applications

Link

https://link.springer.com/content/pdf/10.1186/s13321-023-00725-9.pdf

Reference58 articles.

1. Domingo M, Garcıa-Martınez M, Helle A, et al (2018) How Much Does Tokenization Affect Neural Machine Translation? Arxiv. https://doi.org/10.48550/arxiv.1812.08621