Abstract
AbstractMost protein language models (PLMs), which are used to produce high-quality protein representations, use only protein sequences during training. However, the known protein structure is crucial in many protein property prediction tasks, so there is a growing interest in incorporating the knowledge about the protein structure into a PLM. In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced Structure Adapter, which are then fused and trained together. According to the evaluation on 7 downstream tasks of various nature, both small and medium-sized MULAN models show consistent improvement in quality compared to both sequence-only ESM-2 and structure-aware SaProt. Importantly, our model offers a cheap increase in the structural awareness of the protein representations due to finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure. The implementation, training data and model checkpoints are available athttps://github.com/DFrolova/MULAN.
Publisher
Cold Spring Harbor Laboratory
Reference38 articles.
1. A. V. Finkelstein and O. Ptitsyn , Protein physics: a course of lectures. Elsevier, 2016.
2. THE KINETICS OF FORMATION OF NATIVE RIBONUCLEASE DURING OXIDATION OF THE REDUCED POLYPEPTIDE CHAIN
3. Prottrans: Toward understanding the language of life through self-supervised learning;IEEE transactions on pattern analysis and machine intelligence,2021
4. Language models of protein sequences at the scale of evolution enable accurate structure prediction;BioRxiv,2022
5. A. Elnaggar , H. Essam , W. Salah-Eldin , W. Moustafa , M. Elkerdawy , C. Rochereau , and B. Rost , “Ankh: Optimized protein language model unlocks general-purpose modelling,” bioRxiv, pp. 2023–01, 2023.