Abstract
ABSTRACTOver the past few years, deep learning tools for protein design have made significant advances in the field of bioengineering, opening up new opportunities for drug discovery, disease prevention or industrial biotechnology. However, despite the growing interest and excitement surrounding these tools, progress in the field is hindered by a lack of standardized datasets for benchmarking. Most models are trained on data from the Protein Data Bank (PDB), the largest repository of experimentally determined biological macromolecular structures. But filtering and processing this data involves many hyperparameter choices that are often not harmonized across the research community. Moreover, the task of splitting protein data into training and validation subsets with minimal data leakage is not trivial and often overlooked. Here we present ProteinFlow, a computational pipeline to pre-process protein sequence and structural data for deep learning applications. The pipeline is fully configurable and allows the extraction of all levels of protein organization (primary to quaternary), allowing end-users to cater the dataset for a multitude of downstream tasks, such as protein sequence design, protein folding modeling or protein-protein interaction prediction. In addition, we curate a feature-rich benchmarking dataset based on the latest annual release of the PDB and a selection of preprocessing parameters that are widely used across the research community. We showcase its utility by benchmarking a state-of-the-art (SOTA) deep learning model for protein sequence design. The open source code is packaged as a python library and can be accessed onhttps://github.com/adaptyvbio/ProteinFlow.
Publisher
Cold Spring Harbor Laboratory