Machine learning classification can reduce false positives in structure-based virtual screening-Reference-Cited by-同舟云学术

Machine learning classification can reduce false positives in structure-based virtual screening

Published:2020-01-11 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Adeshina Yusuf,Deeds Eric^ORCID,Karanicolas John^ORCID

Abstract

AbstractWith the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery’s search for active chemical matter. Modern virtual screening methods are still, however, plagued with high false positive rates: typically, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because none of the studies reporting new scoring methods have validated their model prospectively within the same study. Here, we report a new strategy for building a training dataset (D-COID) that aims to generate highly-compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework of gradient-boosted decision trees. In retrospective benchmarks, our new classifier shows outstanding performance relative to other scoring functions. We additionally evaluate the classifier in a prospective context, by screening for new acetylcholinesterase inhibitors. Remarkably, we find that nearly all compounds selected by vScreenML show detectable activity at 50 µM, with 10 of 23 providing greater than 50% inhibition at this concentration. Without any medicinal chemistry optimization, the most potent hit from this initial screen has an IC50 of 280 nM, corresponding to a Ki value of 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.

Publisher

Cold Spring Harbor Laboratory

Reference109 articles.

1. Cancer Genome Landscapes

2. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity

3. Target validation using chemical probes

4. Impact of high-throughput screening in biomedical research

5. Clare RH , Bardelle C , Harper P , Hong WD , Borjesson U , Johnston KL , Collier M , Myhill L , Cassidy A , Plant D , Plant H , Clark R , Cook DAN , Steven A , Archer J , McGillan P , Charoensutthivarakul S , Bibby J , Sharma R , Nixon GL , Slatko BE , Cantin L , Wu B , Turner J , Ford L , Rich K , Wigglesworth M , Berry NG , O’Neill PM , Taylor MJ , Ward SA . Industrial scale high-throughput screening delivers multiple fast acting macrofilaricides. Nat Commun. 2019; 10:11.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Structural Bioinformatics and Artificial Intelligence Approaches in De Novo Drug Design;Marvels of Artificial and Computational Intelligence in Life Sciences;2023-09-18

2. Machine‐learning scoring functions for structure‐based virtual screening;WIREs Computational Molecular Science;2020-04-22

3. The impact of compound library size on the performance of scoring functions for structure-based virtual screening;2020-03-20