Exploring data‐driven chemical SMILES tokenization approaches to identify key protein–ligand binding moieties

Author:

Temizer Asu Busra12ORCID,Uludoğan Gökçe3ORCID,Özçelik Rıza3ORCID,Koulani Taha12ORCID,Ozkirimli Elif4ORCID,Ulgen Kutlu O.5ORCID,Karali Nilgun1ORCID,Özgür Arzucan3ORCID

Affiliation:

1. Department of Pharmaceutical Chemistry Faculty of Pharmacy İstanbul University İstanbul Turkey

2. Department of Pharmaceutical Chemistry Institute of Health Sciences İstanbul University İstanbul Turkey

3. Department of Computer Engineering Boğaziçi University İstanbul Turkey

4. Science and Research Informatics F. Hoffmann-La Roche Ltd Basel Switzerland

5. Department of Chemical Engineering Boğaziçi University İstanbul Turkey

Abstract

AbstractMachine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence‐based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data‐driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language‐inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf–idf weighting. The experiments on multiple protein–ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.

Funder

Türkiye Bilimsel ve Teknolojik Araştırma Kurumu

Publisher

Wiley

Subject

Organic Chemistry,Computer Science Applications,Drug Discovery,Molecular Medicine,Structural Biology

Reference87 articles.

1. P. Dhariwal H. Jun C. Payne J. W. Kim A. Radford I. Sutskever arXiv preprint arXiv:2005.00341 2020.

2. A. Chowdhery S. Narang J. Devlin M. Bosma G. Mishra A. Roberts P. Barham H. W. Chung C. Sutton S. Gehrmann arXiv preprint arXiv:2204.02311 2022.

3. A. Ramesh P. Dhariwal A. Nichol C. Chu M. Chen arXiv preprint arXiv:2204.06125 2022 1 3.

4.  

5. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3