<i>MolLM</i>: a unified language model for integrating biomedical text with 2D and 3D molecular representations-Reference-Cited by-同舟云学术

MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations

Published:2024-06-28 Issue:Supplement_1 Volume:40 Page:i357-i368
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Tang Xiangru¹^ORCID,Tran Andrew¹,Tan Jeffrey¹,Gerstein Mark B¹²³⁴⁵^ORCID

Affiliation:

1. Department of Computer Science, Yale University, New Haven, CT 06520, United States

2. Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States

3. Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States

4. Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States

5. Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States

Abstract

Abstract Motivation The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models’ versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. Results We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM’s self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. Availability and implementation Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.

Funder

National Institutes of Health

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/article-pdf/40/Supplement_1/i357/58585840/btae260.pdf

Reference59 articles.

1. Quantifying the chemical beauty of drugs;Bickerton;Nat Chem,2012

2. A survey and systematic assessment of computational methods for drug response prediction;Chen;Brief Bioinform,2021

3. Convolutional embedding of attributed molecular graphs for physical property prediction;Coley;J Chem Inf Model,2017