Abstract
Protein language models (PLMs) convert amino acid sequences into the numerical representations required to train machine learning (ML) models. Many PLMs are large (>600 M parameters) and trained on a broad span of protein sequence space. However, these models have limitations in terms of predictive accuracy and computational cost. Here, we use multiplexed Ancestral Sequence Reconstruction (mASR) to generate small but focused functional protein sequence datasets for PLM training. Compared to large PLMs, this local ancestral sequence embedding (LASE) produces representations 10-fold faster and with higher predictive accuracy. We show that due to the evolutionary nature of the ASR data, LASE produces smoother fitness landscapes in which protein variants that are closer in fitness value become numerically closer in representation space. This work contributes to the implementation of ML-based protein design in real-world settings, where data is sparse and computational resources are limited.
Publisher
Cold Spring Harbor Laboratory
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献