Affiliation:
1. College of Food and Bioengineering, Henan University of Science and Technology, Luoyang, P. R. China
2. Henan Engineering Research Center of Food Microbiology, Luoyang 471023, P. R. China
Abstract
A front-end method based on random forest proximity distance (PD) is used to screen the test set to improve protein–protein interaction site (PPIS) prediction. The assessment of a distance metric is done under the assumption that a distance definition of higher quality leads to higher classification. On an independent test set, the numerical analysis based on statistical inference shows that the PD has the advantage over Mahalanobis and Cosine distance. Based on the fact that the proximity distance depends on the tree composition of the random forest model, an iterative method is designed to optimize the proximity distance, which adjusts the tree composition of the random forest model by adjusting the size of the training set. Two PD metrics, 75PD and 50PD, are obtained by the iterative method. On two independent test sets, compared with the PD produced by the original training set, the values of 75PD in Matthews correlation coefficient and F1 score were higher, and the differences between them were statistically significant. All numerical experiments show that the closer the distance between the test data and the training data, the better the prediction results of the predictor. These indicate that the iterative method can optimize proximity distance definition and the distance information provided by PD can be used to indicate the reliability of prediction results.
Funder
National Natural Science Foundation of China
Publisher
World Scientific Pub Co Pte Lt
Subject
Computer Science Applications,Molecular Biology,Biochemistry
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献