Author:
Guo Zhihui,Sharma Pramod Kumar,Du Liang,Abraham Robin
Abstract
AbstractMolecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have been popular as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single modality for representing molecules. Driven by the fact that a given molecule can be described through different modalities such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multimodal molecular embedding generation approach called MM-Deacon (multimodal molecular domain embedding analysis via contrastive learning). MM-Deacon is trained using SMILES and IUPAC molecule representations as two different modalities. First, SMILES and IUPAC strings are encoded by using two different transformer-based language models independently, then the contrastive loss is utilized to bring these encoded representations from different modalities closer to each other if they belong to the same molecule, and to push embeddings farther from each other if they belong to different molecules. We evaluate the robustness of our molecule embeddings on molecule clustering, cross-modal molecule search, drug similarity assessment and drug-drug interaction tasks.
Publisher
Cold Spring Harbor Laboratory
Reference85 articles.
1. Georgios M Kontogeorgis and Rafiqul Gani . Computer Aided Property Estimation for Process and Product Design: Computers Aided Chemical Engineering. Elsevier, 2004.
2. Zheng Xu , Sheng Wang , Feiyun Zhu , and Junzhou Huang . Seq2seq fingerprint: An unsupervised deep molecular embedding for drug discovery. In Proceedings of the 8th ACM international conference on bioinformatics, computational biology, and health informatics, pages 285–294, 2017.
3. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations;Chemical science,2019
4. ElemCor: accurate data analysis and enrichment calculation for high-resolution LC-MS stable isotope labeling experiments
5. Prediction of drug–target interactions from multi-molecular network based on deep walk embedding model;Frontiers in Bioengineering and Biotechnology,2020
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献