PINDER: The protein interaction dataset and evaluation resource-Reference-Cited by-同舟云学术

PINDER: The protein interaction dataset and evaluation resource

Published:2024-07-19 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Kovtun Daniel^ORCID,Akdel Mehmet^ORCID,Goncearenco Alexander^ORCID,Zhou Guoqing^ORCID,Holt Graham^ORCID,Baugher David,Lin Dejun,Adeshina Yusuf^ORCID,Castiglione Thomas,Wang Xiaoyun,Marquet Céline^ORCID,McPartlon Matt,Geffner Tomas,Rossi Emanuele^ORCID,Corso Gabriele^ORCID,Stärk Hannes^ORCID,Carpenter Zachary,Kucukbenli Emine,Bronstein Michael,Naef Luca^ORCID

Abstract

AbstractProtein-protein interactions (PPIs) are fundamental to understanding biological processes and play a key role in therapeutic advancements. As deep-learning docking methods for PPIs gain traction, benchmarking protocols and datasets tailored for effective training and evaluation of their generalization capabilities and performance across real-world scenarios become imperative. Aiming to overcome limitations of existing approaches, we introduce PINDER, a comprehensive annotated dataset that uses structural clustering to derive non-redundant interface-based data splits and includesholo(bound),apo(unbound), and computationally predicted structures. PINDER consists of 2,319,564 dimeric PPI systems (and up to 25 million augmented PPIs) and 1,955 high-quality test PPIs with interface data leakage removed. Additionally, PINDER provides a test subset with 180 dimers for comparison to AlphaFold-Multimer without any interface leakage with respect to its training set. Unsurprisingly, the PINDER benchmark reveals that the performance of existing docking models is highly overestimated when evaluated on leaky test sets. Most importantly, by retraining DiffDock-PP on PINDER interface-clustered splits, we show that interface cluster-based sampling of the training split, along with the diverse and less leaky validation split, leads to strong generalization improvements.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. NVIDIA (2024). NVIDIA BioNeMo (v1.6). https://www.nvidia.com/en-us/clara/bionemo/. Download date: 2024-07-03.

2. PPIDomainMiner: Inferring domain-domain interactions from multiple sources of protein-protein interactions

3. Bio, A. Proteinflow. https://github.com/adaptyvbio/ProteinFlow, 2023.

4. Learning to design protein-protein interactions with enhanced generalization;arXiv preprint,2023