A Step Towards Generalisability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening-Reference-Cited by-同舟云学术

A Step Towards Generalisability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening

Published:2022-10-31 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Scantlebury Jack^ORCID,Vost Lucy^ORCID,Carbery Anna^ORCID,Hadfield Thomas E.^ORCID,Turnbull Oliver M.^ORCID,Brown Nathan^ORCID,Chenthamarakshan Vijil^ORCID,Das Payel,Grosjean Harold^ORCID,von Delft Frank^ORCID,Deane Charlotte M.^ORCID

Abstract

ABSTRACTOver the last few years, many machine learning-based scoring functions for predicting the binding of small molecules to proteins have been developed. Their objective is to approximate the distribution which takes two molecules as input and outputs the energy of their interaction. Only a scoring function that accounts for the interatomic interactions involved in binding can accurately predict binding affinity on unseen molecules. However, many scoring functions make predictions based on dataset biases rather than an understanding of the physics of binding. These scoring functions perform well when tested on similar targets to those in the training set, but fail to generalise to dissimilar targets. To test what a machine learning-based scoring function has learnt, input attribution—a technique for learning which features are important to a model when making a prediction on a particular data point—can be applied. If a model successfully learns something beyond dataset biases, attribution should give insight into the important binding interactions that are taking place. We built a machine learning-based scoring function that aimed to avoid the influence of bias via thorough train and test dataset filtering, and show that it achieves comparable performance on the CASF-2016 benchmark to other leading methods. We then use the CASF-2016 test set to perform attribution, and find that the bonds identified as important by PointVS, unlike those extracted from other scoring functions, have a high correlation with those found by a distance-based interaction profiler. We then show that attribution can be used to extract important binding pharmacophores from a given protein target when supplied with a number of bound structures. We use this information to perform fragment elaboration, and see improvements in docking scores compared to using structural information from a traditional, data-based approach. This not only provides definitive proof that the scoring function has learnt to identify some important binding interactions, but also constitutes the first deep learning-based method for extracting structural information from a target for molecule design.

Publisher

Cold Spring Harbor Laboratory

Reference55 articles.

1. E. Barnett , D. Onete , A. Salekin , and S. V. Faraone , “Ge-nomic machine learning meta-regression: Insights on as-sociations of study features with reported model perfor-mance,” medRxiv, 2022.

2. Inflated pre-diction accuracy of neuropsychiatric biomarkers caused by data leakage in feature selection;Scientific Reports,2021

3. Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans;Nature Machine Intelligence,2021

4. F. Tu , J. Zhu , Q. Zheng , and M. Zhou , “Be careful of when: An empirical study on time-related misuse of issue tracking data,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, (New York, NY, USA), p. 307–318, Association for Computing Machinery, 2018.

5. In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. HydraScreen: A Generalizable Structure-Based Deep Learning Approach to Drug Discovery;Journal of Chemical Information and Modeling;2024-07-22

2. Robustly interrogating machine learning-based scoring functions: what are they learning?;2023-11-02