Small molecule machine learning: All models are wrong, some may not even be useful-Reference-Cited by-同舟云学术

Small molecule machine learning: All models are wrong, some may not even be useful

Published:2023-03-27 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Kretschmer Fleming^ORCID,Seipp Jan,Ludwig Marcus^ORCID,Klau Gunnar W.,Böcker Sebastian^ORCID

Abstract

AbstractA central assumption of all machine learning is that the training data are an informative subset of the true distribution we want to learn. Yet, this assumption may be violated in practice. Recently, learning from the molecular structures of small molecules has moved into the focus of the machine learning community. Usually, those small molecules are of biological interest, such as metabolites or drugs. Applications include prediction of toxicity, ligand binding or retention time.We investigate how well certain large-scale datasets cover the space of all known biomolecular structures. Investigation of coverage requires a sensible distance measure between molecular structures. We use a well-known distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which agrees well with the chemical and biochemical intuition of similarity between compounds. Unfortunately, this computational problem is NP-hard, severely restricting the use of the corresponding distance measure in large-scale studies. We introduce an exact approach that combines Integer Linear Programming and intricate heuristic bounds to ensure efficient computations and dependable results.We find that several large-scale datasets frequently used in this domain of machine learning are far from a uniform coverage of known biomolecular structures. This severely confines the predictive power of models trained on this data. Next, we propose two further approaches to check if a training dataset differs substantially from the distribution of known biomolecular structures. On the positive side, our methods may allow creators of large-scale datasets to identify regions in molecular structure space where it is advisable to provide additional training data.

Publisher

Cold Spring Harbor Laboratory

Reference83 articles.

1. Best practices in machine learning for chemistry;Nat Chem,2021

2. DOME: recommendations for supervised machine learning validation in biology

3. Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics;J Proteome Res,2022

4. A guide to machine learning for biologists

5. MoleculeNet: a benchmark for molecular machine learning

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance and robustness of small molecule retention time prediction with molecular graph neural networks in industrial drug discovery campaigns;Scientific Reports;2024-04-16

2. RepoRT: a comprehensive repository for small molecule retention times;Nature Methods;2024-01-08