ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications-Reference-Cited by-同舟云学术

ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications

Published:2023-09-26 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Kozlova Elizaveta^ORCID,Valentin Arthur^ORCID,Khadhraoui Aous^ORCID,Nakhaee-Zadeh Gutierrez Daniel^ORCID

Abstract

ABSTRACTOver the past few years, deep learning tools for protein design have made significant advances in the field of bioengineering, opening up new opportunities for drug discovery, disease prevention or industrial biotechnology. However, despite the growing interest and excitement surrounding these tools, progress in the field is hindered by a lack of standardized datasets for benchmarking. Most models are trained on data from the Protein Data Bank (PDB), the largest repository of experimentally determined biological macromolecular structures. But filtering and processing this data involves many hyperparameter choices that are often not harmonized across the research community. Moreover, the task of splitting protein data into training and validation subsets with minimal data leakage is not trivial and often overlooked. Here we present ProteinFlow, a computational pipeline to pre-process protein sequence and structural data for deep learning applications. The pipeline is fully configurable and allows the extraction of all levels of protein organization (primary to quaternary), allowing end-users to cater the dataset for a multitude of downstream tasks, such as protein sequence design, protein folding modeling or protein-protein interaction prediction. In addition, we curate a feature-rich benchmarking dataset based on the latest annual release of the PDB and a selection of preprocessing parameters that are widely used across the research community. We showcase its utility by benchmarking a state-of-the-art (SOTA) deep learning model for protein sequence design. The open source code is packaged as a python library and can be accessed onhttps://github.com/adaptyvbio/ProteinFlow.

Publisher

Cold Spring Harbor Laboratory

Reference67 articles.

1. ccPDB 2.0: an updated version of datasets created and compiled from Protein Data Bank

2. Gustaf Ahdritz , Nazim Bouatta , Sachin Kadyan , Qinghui Xia , William Gerecke , Timothy J O’Donnell , Daniel Berenberg , Ian Fisk , Niccolò Zanichelli , Bo Zhang , et al. 2022. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv (2022), 2022–11.

3. 2019. ProteinNet: a standardized data set for machine learning of protein structure;BMC Bioinformatics,2019

4. Protein sequence design with a learned potential

5. Protein Data Bank: the single global archive for 3D macromolecular structure data