OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization-Reference-Cited by-同舟云学术

OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Published:2022-11-22 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ahdritz Gustaf^ORCID,Bouatta Nazim^ORCID,Kadyan Sachin^ORCID,Xia Qinghui,Gerecke William^ORCID,O’Donnell Timothy J^ORCID,Berenberg Daniel^ORCID,Fisk Ian,Zanichelli Niccolò^ORCID,Zhang Bo^ORCID,Nowaczynski Arkadiusz^ORCID,Wang Bei^ORCID,Stepniewska-Dziubinska Marta M^ORCID,Zhang Shang^ORCID,Ojewole Adegoke^ORCID,Guney Murat Efe,Biderman Stella^ORCID,Watkins Andrew M^ORCID,Ra Stephen^ORCID,Lorenzo Pablo Ribalta^ORCID,Nivon Lucas^ORCID,Weitzner Brian^ORCID,Ban Yih-En Andrew^ORCID,Sorger Peter K^ORCID,Mostaque Emad,Zhang Zhao^ORCID,Bonneau Richard^ORCID,AlQuraishi Mohammed^ORCID

Abstract

AbstractAlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (i) tackle new tasks, like protein-ligand complex structure prediction, (ii) investigate the process by which the model learns, which remains poorly understood, and (iii) assess the model’s generalization capacity to unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2, and OpenProtein-Set, the largest public database of protein multiple sequence alignments. We use OpenProteinSet to train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold’s capacity to generalize across fold space by retraining it using carefully designed datasets. We find that OpenFold is remarkably robust at generalizing despite extreme reductions in training set size and diversity, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced by OpenFold during training, we also gain surprising insights into the manner in which the model learns to fold proteins, discovering that spatial dimensions are learned sequentially. Taken together, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial new resource for the protein modeling community.

Publisher

Cold Spring Harbor Laboratory

Reference80 articles.

1. Unified rational protein engineering with sequence-based deep representation learning

2. End-to-End Differentiable Learning of Protein Structure;Cell Systems,2019

3. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures

4. Principles that Govern the Folding of Protein Chains

5. Baek, M. (2021). Twitter post: Adding a big enough number for “residue_index” feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/ residue_index modification). https://twitter.com/minkbaek/status/1417538291709071362.

Cited by 90 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reshaping the binding pocket of D-tagaturonate epimerase UxaE to improve the epimerization activity of C4-OH for enabling green synthesis of d-tagatose;Molecular Catalysis;2024-09

2. AlphaFold predictions of fold-switched conformations are driven by structure memorization;Nature Communications;2024-08-24

3. The Natural Future for AI in Biotech: The Next Generation of Machine Learning Demands Partnership with Biodiversity;GEN Biotechnology;2024-08-01

4. Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations;Nature Communications;2024-07-23

5. AlphaFold Model Quality Self-Assessment Improvement Via Deep Graph Learning;2024-07-22