Abstract
AbstractProtein-protein interactions (PPIs) are fundamental to understanding biological processes and play a key role in therapeutic advancements. As deep-learning docking methods for PPIs gain traction, benchmarking protocols and datasets tailored for effective training and evaluation of their generalization capabilities and performance across real-world scenarios become imperative. Aiming to overcome limitations of existing approaches, we introduce PINDER, a comprehensive annotated dataset that uses structural clustering to derive non-redundant interface-based data splits and includesholo(bound),apo(unbound), and computationally predicted structures. PINDER consists of 2,319,564 dimeric PPI systems (and up to 25 million augmented PPIs) and 1,955 high-quality test PPIs with interface data leakage removed. Additionally, PINDER provides a test subset with 180 dimers for comparison to AlphaFold-Multimer without any interface leakage with respect to its training set. Unsurprisingly, the PINDER benchmark reveals that the performance of existing docking models is highly overestimated when evaluated on leaky test sets. Most importantly, by retraining DiffDock-PP on PINDER interface-clustered splits, we show that interface cluster-based sampling of the training split, along with the diverse and less leaky validation split, leads to strong generalization improvements.
Publisher
Cold Spring Harbor Laboratory
Reference32 articles.
1. NVIDIA (2024). NVIDIA BioNeMo (v1.6). https://www.nvidia.com/en-us/clara/bionemo/. Download date: 2024-07-03.
2. PPIDomainMiner: Inferring domain-domain interactions from multiple sources of protein-protein interactions
3. Bio, A. Proteinflow. https://github.com/adaptyvbio/ProteinFlow, 2023.
4. Learning to design protein-protein interactions with enhanced generalization;arXiv preprint,2023