Abstract
AbstractLarge pretrained protein language models (PLMs) have improved protein property and structure prediction from sequences via transfer learning, in which weights and representations from PLMs are repurposed for downstream tasks. Although PLMs have shown great promise, currently there is little understanding of how the features learned by pretraining relate to and are useful for downstream tasks. We perform a systematic analysis of transfer learning using PLMs, conducting 370 experiments across a comprehensive suite of factors including different downstream tasks, architectures, model sizes, model depths, and pretraining time. We observe that while almost all down-stream tasks do benefit from pretrained models compared to naive sequence representations, for the majority of tasks performance does not scale with pretraining, and instead relies on low-level features learned early in pretraining. Our results point to a mismatch between current PLM pretraining paradigms and most applications of these models, indicating a need for better pretraining methods.
Publisher
Cold Spring Harbor Laboratory
Reference57 articles.
1. Abnar, S. , Dehghani, M. , Neyshabur, B. , and Sedghi, H. Exploring the limits of large scale pre-training. ICLR, 2022.
2. Unified rational protein engineering with sequence-based deep representation learning
3. Almagro Armenteros, J. J. , Johansen, A. R. , Winther, O. , and Nielsen, H. Language modelling for biological sequences–curated datasets and baselines. BioRxiv, pp. 2020–03, 2020.
4. DeepLoc: prediction of protein subcellular localization using deep learning
5. ProteinBERT: a universal deep-learning model of protein sequence and function
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献