Author:
Chen Lieyang,Cruz Anthony,Ramsey Steven,Dickson Callum
J.,Duca José S.,Hornak Viktor,Koes David R.,Kurtzman Tom
Abstract
<p>Recently much effort has been invested in using convolutional neural
network (CNN) models trained on 3D structural images of protein-ligand
complexes to distinguish binding from non-binding ligands for virtual screening.
However, the dearth of reliable protein-ligand x-ray structures and binding affinity
data has required the use of constructed datasets for the training and
evaluation of CNN molecular recognition models. Here, we outline various
sources of bias in one such widely-used dataset, the Directory of Useful
Decoys: Enhanced (DUD-E). We have constructed and performed tests to
investigate whether CNN models developed using DUD-E are properly learning the
underlying physics of molecular recognition, as intended, or are instead
learning biases inherent in the dataset itself. We find that superior
enrichment efficiency in CNN models can be attributed to the analogue and decoy
bias hidden in the DUD-E dataset rather than successful generalization of the
pattern of protein-ligand interactions. Comparing additional deep learning
models trained on PDBbind datasets, we found that their enrichment performances
using DUD-E are not superior to the performance of the docking program AutoDock
Vina. Together, these results suggest that biases that could be present in
constructed datasets should be thoroughly evaluated before applying them to
machine learning based methodology development. </p>
Publisher
American Chemical Society (ACS)
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献