Author:
Tang Xiangru,Tran Andrew,Tan Jeffrey,Gerstein Mark B.
Abstract
AbstractMotivationThe present paradigm of deep learning models for molecular representation relies mostly on 1D or 2D formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the model’s versatility and adaptability across a wide range of modalities. Conversely, the smaller amount of research that focuses on explicit 3D representation tends to overlook textual data within the biomedical domain.ResultsWe present a unified pre-trained language model that concurrently captures biomedical text, 2D, and 3D molecular information. Our model, the three-modality molecular language model MolLM, consists of a text Transformer encoder and a molecular Transformer encoder, which encodes both 2D and 3D molecular structures. For MolLM, we construct 168K molecule-text pairings for training. We employ contrastive learning as a supervisory signal for cross-modal information learning. MolLM demonstrates robust molecular representation capabilities in numerous downstream tasks, including cross-modality molecule and text matching, property prediction, captioning, and text-prompted editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance on downstream tasks.Availability and implementationOur code, data, and pre-trained model weights are all available athttps://github.com/gersteinlab/MolLM.
Publisher
Cold Spring Harbor Laboratory
Reference72 articles.
1. Representation of molecules for drug response prediction;Briefings in Bioinformatics,2022
2. Quantifying the chemical beauty of drugs;Nature Chemistry,2012
3. A survey and systematic assessment of computational methods for drug response prediction;Briefings in bioinformatics,2021
4. Chilingaryan, G. et al. (2022). Bartsmiles: Generative masked language models for molecular representations.
5. Chithrananda, S. et al. (2020). Chemberta: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885.