Author:
Zheng Zaixiang,Deng Yifan,Xue Dongyu,Zhou Yi,Ye Fei,Gu Quanquan
Abstract
AbstractThis paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct astructural surgeryonpLMs, where a lightweight structural adapter is implanted intopLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-Designimproves the state-of-the-art results by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and>60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Designcan (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies andde novoproteins).
Publisher
Cold Spring Harbor Laboratory
Reference83 articles.
1. Rosettaantibodydesign (rabd): A general framework for computational antibody design;PLoS computational biology,2018
2. The rosetta all-atom energy function for macromolecular modeling and design;Journal of chemical theory and computation,2017
3. Accurate prediction of protein structures and interactions using a three-track neural network
4. Bahdanau, D. , Cho, K. , and Bengio, Y. Neural machine translation by jointly learning to align and translate. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Confer-ence on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.
5. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003