OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Author:

Ahdritz GustafORCID,Bouatta NazimORCID,Kadyan SachinORCID,Xia Qinghui,Gerecke WilliamORCID,O’Donnell Timothy JORCID,Berenberg DanielORCID,Fisk Ian,Zanichelli NiccolòORCID,Zhang BoORCID,Nowaczynski ArkadiuszORCID,Wang BeiORCID,Stepniewska-Dziubinska Marta MORCID,Zhang ShangORCID,Ojewole AdegokeORCID,Guney Murat Efe,Biderman StellaORCID,Watkins Andrew MORCID,Ra StephenORCID,Lorenzo Pablo RibaltaORCID,Nivon LucasORCID,Weitzner BrianORCID,Ban Yih-En AndrewORCID,Sorger Peter KORCID,Mostaque Emad,Zhang ZhaoORCID,Bonneau RichardORCID,AlQuraishi MohammedORCID

Abstract

AbstractAlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (i) tackle new tasks, like protein-ligand complex structure prediction, (ii) investigate the process by which the model learns, which remains poorly understood, and (iii) assess the model’s generalization capacity to unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2, and OpenProtein-Set, the largest public database of protein multiple sequence alignments. We use OpenProteinSet to train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold’s capacity to generalize across fold space by retraining it using carefully designed datasets. We find that OpenFold is remarkably robust at generalizing despite extreme reductions in training set size and diversity, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced by OpenFold during training, we also gain surprising insights into the manner in which the model learns to fold proteins, discovering that spatial dimensions are learned sequentially. Taken together, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial new resource for the protein modeling community.

Publisher

Cold Spring Harbor Laboratory

Reference80 articles.

1. Unified rational protein engineering with sequence-based deep representation learning

2. End-to-End Differentiable Learning of Protein Structure;Cell Systems,2019

3. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures

4. Principles that Govern the Folding of Protein Chains

5. Baek, M. (2021). Twitter post: Adding a big enough number for “residue_index” feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/ residue_index modification). https://twitter.com/minkbaek/status/1417538291709071362.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3