Practical guidelines for the use of gradient boosting for molecular property prediction-Reference-Cited by-同舟云学术

Practical guidelines for the use of gradient boosting for molecular property prediction

Published:2023-08-28 Issue:1 Volume:15 Page:
ISSN:1758-2946
Container-title:Journal of Cheminformatics
language:en
Short-container-title:J Cheminform

Author:

Boldini Davide,Grisoni Francesca,Kuhn Daniel,Friedrich Lukas,Sieber Stephan A.

Abstract

AbstractDecision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure–activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications. Graphical abstract

Funder

Technische Universität München

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Computer Graphics and Computer-Aided Design,Physical and Theoretical Chemistry,Computer Science Applications

Link

https://link.springer.com/content/pdf/10.1186/s13321-023-00743-7.pdf

Reference57 articles.

1. Keshavarzi Arshadi A, Salem M, Firouzbakht A, Yuan JS (2022) MolData, a molecular benchmark for disease and target based machine learning. J Cheminf 14(1):10. https://doi.org/10.1186/s13321-022-00590-y

2. Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388. https://doi.org/10.1021/acs.jcim.9b00237

3. Aleksić S, Seeliger D, Brown JB (2021) ADMET Predictability at Boehringer Ingelheim: state-of-the-art, and do bigger datasets or algorithms make a difference? Mol Inform. https://doi.org/10.1002/minf.202100113