Abstract
AbstractDesigning novel proteins tailored for specific purposes presents a promising approach to address various biomedical challenges, including drug discovery, vaccine design, etc.. The protein language models (ProtLMs) with control tags as a prefix or fine-tuning in a specific domain have achieved unprecedented breakthroughs in the controllable protein design. However, the vocabulary of protein sequences only contains 20 amino acid residues, which is not like natural language vocabulary to make up flexible control tags. Moreover, due to their large number of parameters, fine-tuning ProtLMs with limited data is challenging. In this study, we propose a flexible and controllable protein design method, named PrefixProt, which employs prefix-tuning to learn a virtual token for each protein property on corresponding datasets. Subsequently, the learned virtual tokens can be used to prompt pre-trained ProtLMs for generating proteins with tailored structures and functions. We trained two prefix virtual tokens on alpha-helix structure dataset and antimicrobial peptide (AMP) dataset, respectively. Our results demonstrate that prefix virtual tokens are efficient to prompt the pre-trained ProtLM by optimizing fewer trainable parameters to achieve superior results compared with fine tuning, even under low-data settings. Furthermore, these two prefix virtual tokens are combined to precisely control protein generation with both AMP function and alpha-helix structure. These results demonstrate prefix virtual tokens are flexible to be learned and integrated to control the generation of proteins. Therefore PrefixProt has advantages of both control tags and fine-tuning. In summary, PrefixProt offers a flexible and controllable protein design solution. We anticipate that PrefixProt will contribute to protein discovery and biomedical advancement.Availability and implementationThe models and associated code are available at:https://github.com/chen-bioinfo/PrefixProt
Publisher
Cold Spring Harbor Laboratory
Reference45 articles.
1. SCOP2 prototype: a new approach to protein structure mining
2. Yoshua Bengio , Réjean Ducharme , and Pascal Vincent . A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
3. Language models are few-shot learners;Advances in neural information processing systems,2020
4. Design of protein-binding proteins from the target structure alone;Nature,2022
5. Machine learning designs non-hemolytic antimicrobial peptides;Chemical science,2021