Abstract
AbstractWe consider the problem of predicting a protein sequence from its backbone atom coordinates. Machine learning approaches to this problem to date have been limited by the number of available experimentally determined protein structures. We augment training data by nearly three orders of magnitude by predicting structures for 12M protein sequences using AlphaFold2. Trained with this additional data, a sequence-to-sequence transformer with invariant geometric input processing layers achieves 51% native sequence recovery on structurally held-out backbones with 72% recovery for buried residues, an overall improvement of almost 10 percentage points over existing methods. The model generalizes to a variety of more complex tasks including design of protein complexes, partially masked structures, binding interfaces, and multiple states.
Publisher
Cold Spring Harbor Laboratory
Reference82 articles.
1. The rosetta all-atom energy function for macromolecular modeling and design;Journal of chemical theory and computation,2017
2. Unified rational protein engineering with sequence-based deep representation learning;Nature methods,2019
3. Anand, N. and Achim, T. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models, 2022.
4. Anand, N. and Huang, P. Generative modeling for protein structures. Advances in neural information processing systems, 31, 2018.
5. Anand-Achim, N. , Eguchi, R. R. , Mathews, I. I. , Perez, C. P. , Derry, A. , Altman, R. B. , and Huang, P.-S. Protein sequence design with a learned potential. Biorxiv, pp. 2020–01, 2021.
Cited by
167 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献