Abstract
Generative models for protein design trained on experimentally determined structures have proven useful for a variety of design tasks. However, such methods are limited by the quantity and diversity of structures used for training, which represent a small, biased fraction of protein space. Here, we describe proseLM, a method for protein sequence design based on adaptation of protein language models to incorporate structural and functional context. We show that proseLM benefits from the scaling trends of underlying language models, and that the addition of non-protein context – nucleic acids, ligands, and ions – improves recovery of native residues during design by 4-5% across model scales. These improvements are most pronounced for residues that directly interface with non-protein context, which are faithfully recovered at rates >70% by the most capable proseLM models. We experimentally validated proseLM by optimizing the editing efficiency of genome editors in human cells, achieving a 50% increase in base editing activity, and by redesigning therapeutic antibodies, resulting in a PD-1 binder with 2.2 nM affinity.
Publisher
Cold Spring Harbor Laboratory
Reference42 articles.
1. Andrew Leaver-Fay , Michael Tyka , Steven M Lewis , Oliver F Lange , James Thompson , Ron Jacak , Kristian W Kaufman , P Douglas Renfrew , Colin A Smith , Will Sheffler , et al. Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology, volume 487, pages 545–574. Elsevier, 2011.
2. John Ingraham , Vikas Garg , Regina Barzilay , and Tommi Jaakkola . Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
3. Robust deep learning–based protein sequence design using proteinmpnn;Science,2022
4. Chloe Hsu , Robert Verkuil , Jason Liu , Zeming Lin , Brian Hie , Tom Sercu , Adam Lerer , and Alexander Rives . Learning inverse folding from millions of predicted structures. In International Conference on Machine Learning, pages 8946–8970. PMLR, 2022.
5. Hallucinating symmetric protein assemblies;Science,2022