The influence of negative training set size on machine learning-based virtual screening-Reference-Cited by-同舟云学术

The influence of negative training set size on machine learning-based virtual screening

Published:2014-06-11 Issue:1 Volume:6 Page:
ISSN:1758-2946
Container-title:Journal of Cheminformatics
language:en
Short-container-title:J Cheminform

Author:

Kurczab Rafał,Smusz Sabina,Bojarski Andrzej J

Abstract

Abstract Background The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. Results The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. Conclusions In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.

Publisher

Springer Science and Business Media LLC

Subject

Library and Information Sciences,Computer Graphics and Computer-Aided Design,Physical and Theoretical Chemistry,Computer Science Applications

Link

http://link.springer.com/article/10.1186/1758-2946-6-32/fulltext.html

Reference26 articles.

1. Melville JL, Burke EK, Hirst JD: Machine learning in virtual screening. Comb Chem High Throughput Screen. 2009, 12: 332-343. 10.2174/138620709788167980.

2. Ma XH, Wang R, Yang SY, Li ZR, Xue Y, Wei YC, Low BC, Chen YZ: Evaluation of virtual screening performance of support vector machines trained by sparsely distributed active compounds. J Chem Inf Model. 2008, 48: 1227-1237. 10.1021/ci800022e.

3. Plewczynski D, Spieser SH, Koch U: Assessing different classification methods for virtual screening. J Chem Inf Model. 2006, 46: 1098-1106. 10.1021/ci050519k.

4. Bruce CL, Melville JL, Pickett SD, Hirst JD: Contemporary QSAR classifiers compared. J Chem Inf Model. 2007, 47: 219-227. 10.1021/ci600332j.

5. Smusz S, Kurczab R, Bojarski AJ: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds. Chemom Intell Lab Syst. 2013, 128: 89-100.

Cited by 62 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep learning for low-data drug discovery: Hurdles and opportunities;Current Opinion in Structural Biology;2024-06

2. HealthPathFinder: Navigating the Healthcare Knowledge Graph with Neural Attention for Personalized Health Recommendations;Lecture Notes in Networks and Systems;2024

3. Machine learning for small molecule drug discovery in academia and industry;Artificial Intelligence in the Life Sciences;2023-12

4. Recent advances in deep learning for retrosynthesis;WIREs Computational Molecular Science;2023-10-20

5. Smart systems in producing algae-based protein to improve functional food ingredients industries;Food Research International;2023-03