Knowledge-based BERT: a method to extract molecular features like computational chemists

Author:

Wu Zhenxing123,Jiang Dejun1,Wang Jike14,Zhang Xujun1,Du Hongyan1,Pan Lurong5,Hsieh Chang-Yu6,Cao Dongsheng7ORCID,Hou Tingjun123ORCID

Affiliation:

1. Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China

2. Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China

3. State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China

4. National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, Hubei, P. R. China

5. Global Health Drug Discovery Institute, Beijing 100192, P. R. China

6. Tencent Quantum Laboratory, Tencent, Shenzhen 518057, Guangdong, P. R. China

7. Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410004, Hunan, P. R. China

Abstract

Abstract Molecular property prediction models based on machine learning algorithms have become important tools to triage unpromising lead molecules in the early stages of drug discovery. Compared with the mainstream descriptor- and graph-based methods for molecular property predictions, SMILES-based methods can directly extract molecular features from SMILES without human expert knowledge, but they require more powerful algorithms for feature extraction and a larger amount of data for training, which makes SMILES-based methods less popular. Here, we show the great potential of pre-training in promoting the predictions of important pharmaceutical properties. By utilizing three pre-training tasks based on atom feature prediction, molecular feature prediction and contrastive learning, a new pre-training method K-BERT, which can extract chemical information from SMILES like chemists, was developed. The calculation results on 15 pharmaceutical datasets show that K-BERT outperforms well-established descriptor-based (XGBoost) and graph-based (Attentive FP and HRGCN+) models. In addition, we found that the contrastive learning pre-training task enables K-BERT to ‘understand’ SMILES not limited to canonical SMILES. Moreover, the general fingerprints K-BERT-FP generated by K-BERT exhibit comparative predictive power to MACCS on 15 pharmaceutical datasets and can also capture molecular size and chirality information that traditional binary fingerprints cannot capture. Our results illustrate the great potential of K-BERT in the practical applications of molecular property predictions in drug discovery.

Funder

Natural Science Foundation of China

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Reference53 articles.

1. Drug discovery—an operating model for a new era;Myers;Nat Biotechnol,2001

2. Innovation in the pharmaceutical industry: new estimates of R&D costs;DiMasi;J Health Econ,2016

3. An analysis of the attrition of drug candidates from four major pharmaceutical companies;Waring;Nat Rev Drug Discov,2015

4. Minimal-uncertainty prediction of general drug-likeness based on Bayesian neural networks;Beker;Nature Machine Intelligence,2020

5. Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT;Li;J Chem,2020

Cited by 20 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3