Can large language models understand molecules?-Reference-Cited by-同舟云学术

Can large language models understand molecules?

Published:2024-06-26 Issue:1 Volume:25 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Sadeghi Shaghayegh,Bui Alan,Forooghi Ali,Lu Jianguo,Ngom Alioune

Abstract

Abstract Purpose Large Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT.

Funder

Natural Sciences and Engineering Research Council of Canada

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s12859-024-05847-x.pdf

Reference45 articles.

1. Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X, et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Br Bioinform. 2021;22(6):bbab109.

2. Lv Q, Chen G, Zhao L, Zhong W, Yu-Chian CC. Mol2Context-vec: learning molecular representation from context awareness for drug discovery. Br Bioinform. 2021;22(6):bbab317.

3. Liu Y, Zhang R, Li T, Jiang J, Ma J, Wang P. MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. J Mol Graph Model. 2023;118: 108344.

4. Ross J, Belgodere B, Chenthamarakshan V, Padhi I, Mroueh Y, Das P. Large-scale chemical language representations capture molecular structure and properties. Nat Mach Intell. 2022;4(12):1256–64.

5. Zhang XC, Wu CK, Yang ZJ, Wu ZX, Yi JC, Hsieh CY, et al. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Br Bioinform. 2021;22(6):bbab152.