Abstract
AbstractProtein-ligand interactions (PLI) are foundational to small molecule drug design. With computational methods striving towards experimental accuracy, there is a critical demand for a well-curated and diverse PLI dataset. Existing datasets are often limited in size and diversity, and commonly used evaluation sets suffer from training information leakage, hindering the realistic assessment of method generalization capabilities. To address these shortcomings, we present PLIN-DER, the largest and most annotated dataset to date, comprising 449,383 PLI systems, each with over 500 annotations, similarity metrics at protein, pocket, interaction and ligand levels, and paired unbound (apo) and predicted structures. We propose an approach to generate training and evaluation splits that minimizes task-specific leakage and maximizes test set quality, and compare the resulting performance of DiffDock when retrained with different kinds of splits.
Publisher
Cold Spring Harbor Laboratory
Reference51 articles.
1. Argo Workflow (v3.5.8). https://github.com/argoproj.
2. NVIDIA BioNeMo (v1.4). https://www.nvidia.com/en-us/clara/bionemo.
3. Kubernetes (v1.30). https://kubernetes.io/.
4. Metaflow (v2.11.15). https://docs.metaflow.org/.
5. Rdkit: Open-source cheminformatics. https://www.rdkit.org. Accessed: 2024-05-17.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献