Author:
Serrano Yaiza,Roda Sergi,Guallar Victor,Molina Alexis
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities in understanding contextual relationships, outperforming traditional methodologies in downstream tasks such as text generation and sentence classification. This success has been mirrored in the realm of protein language models (pLMs), where proteins are encoded as text via their amino acid sequences. However, the training of pLMs, which involves tens to hundreds of millions of sequences and hundreds of millions to billions of parameters, poses a significant computational challenge.In this study, we introduce a Small-Scale Protein Language Model (SS-pLM), a more accessible approach that requires training on merely millions of representative sequences, reducing the number of trainable parameters to 14.8M. This model significantly reduces the computational load, thereby democratizing the use of foundational models in protein studies. We demonstrate that the performance of our model, when fine-tuned to a specific set of sequences for generation, is comparable to that of larger, more computationally demanding pLM.
Publisher
Cold Spring Harbor Laboratory
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献