The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications-Reference-Cited by-同舟云学术

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Published:2024-06-12 Issue:1 Volume:7 Page:
ISSN:2399-3669
Container-title:Communications Chemistry
language:en
Short-container-title:Commun Chem

Author:

Snyder Scott H.^ORCID,Vignaux Patricia A.,Ozalp Mustafa Kemal,Gerlach Jacob,Puhl Ana C.,Lane Thomas R.,Corbett John,Urbina Fabio,Ekins Sean^ORCID

Abstract

AbstractRecent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the ‘no-free lunch’ theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a ‘goldilocks zone’ for each model type, in which dataset size and feature distribution (i.e. dataset “diversity”) determines the optimal algorithm strategy. When datasets are small ( < 50 molecules), FSLC tend to outperform both classical ML and transformers. When datasets are small-to-medium sized (50-240 molecules) and diverse, transformers outperform both classical models and few-shot learning. Finally, when datasets are of larger and of sufficient size, classical models then perform the best, suggesting that the optimal model to choose likely depends on the dataset available, its size and diversity. These findings may help to answer the perennial question of which ML algorithm is to be used when faced with a new dataset.

Funder

U.S. Department of Health & Human Services | NIH | National Institute of General Medical Sciences

U.S. Department of Health & Human Services | NIH | National Institute of Environmental Health Sciences

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s42004-024-01220-4.pdf

Reference91 articles.

1. Ekins, S. et al. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 18, 435–441 (2019).

2. Ekins, S., Lane, T. R., Urbina, F. & Puhl A. C. In silico ADME/tox comes of age: twenty years later. Xenobiotica 1–7, https://doi.org/10.1080/00498254.2023.2245049 (2023).

3. Cheng, F., Li, W., Liu, G. & Tang, Y. In silico ADMET prediction: recent advances, current challenges and future trends. Curr. Top. Med. Chem. 13, 1273–1289 (2013).

4. Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37, 1038–1040 (2019).

5. Ekins, S., Mestres, J. & Testa, B. In silico pharmacology for drug discovery: applications to targets and beyond. Br. J. Pharm. 152, 21–37 (2007).

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Predicting the Hallucinogenic Potential of Molecules Using Artificial Intelligence;ACS Chemical Neuroscience;2024-08-02