Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI

Author:

O’Boyle Noel M

Abstract

Abstract Background There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string. Results I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset. Conclusions The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Computer Graphics and Computer-Aided Design,Physical and Theoretical Chemistry,Computer Science Applications

Reference44 articles.

1. Warr WA: Representation of chemical structures. WIREs Comput Mol Sci. 2011, 1: 557-579. 10.1002/wcms.36.

2. Ash S, Cline MA, Homer RW, Hurst T, Smith GB: SYBYL Line Notation (SLN): A Versatile Language for Chemical Structure Representation. J Chem Inf Comput Sci. 1997, 37: 71-79. 10.1021/ci960109j.

3. Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD: SYBYL Line Notation (SLN): A Single Notation To Represent Chemical Structures, Queries, Reactions, and Virtual Libraries. J Chem Inf Model. 2008, 48: 2294-2307. 10.1021/ci7004687.

4. Bolton EE, Wang Y, Thiessen PA, Bryant SH: Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities. Annual Reports in Computational Chemistry. 2008, Elsevier, 217-241.

5. International Union of Pure and Applied Chemistry. Commission on the Nomenclature of Organic Chemistry, Panico R, Powell WH, Richer J-C: A guide to IUPAC nomenclature of organic compounds: recommendations 1993. 1993, Oxford; Boston; Boca Raton, Fla: Blackwell Scientific Publications; CRC Press [distributor]

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3