Author:
Chen Bo,Cheng Xingyi,Li Pan,Geng Yangli-ao,Gong Jing,Li Shen,Bei Zhilei,Tan Xu,Wang Boyan,Zeng Xin,Liu Chiming,Zeng Aohan,Dong Yuxiao,Tang Jie,Song Le
Abstract
Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
Publisher
Cold Spring Harbor Laboratory
Reference120 articles.
1. J. Jumper , et al., Highly accurate protein structure prediction with alphafold, Nature (2021).
2. M. Baek , et al., Accurate prediction of protein structures and interactions using a three-track neural network, Science (2021).
3. C. B. Anfinsen , et al., The molecular basis of evolution., The molecular basis of evolution. (1959).
4. A. Rives , et al., Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences, Proceedings of the National Academy of Sciences (2021).
5. Z. Lin , et al., Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023).
Cited by
29 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献