Codon language embeddings provide strong signals for protein engineering-Reference-Cited by-同舟云学术

Codon language embeddings provide strong signals for protein engineering

Published:2022-12-19 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Outeiral Carlos^ORCID,Deane Charlotte M.^ORCID

Abstract

AbstractProtein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models’ capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of tasks. In some tasks, like species recognition, prediction of protein and transcript abundance, or melting point estimation, we show that a language model trained on codons outperforms every other published protein language model, including some that contain over 50 times more parameters. These results suggest that, in addition to commonly studied scale and model complexity, the information content of biological data provides an orthogonal direction to improve the power of machine learning in biology.

Publisher

Cold Spring Harbor Laboratory

Reference55 articles.

1. Controllable protein design with language models;Nature Machine Intelligence,2022

2. UniRef: comprehensive and non-redundant UniProt reference clusters

3. PACVr: plastome assembly coverage visualization in R

4. Language models enable zero-shot prediction of the effects of mutations on protein function;Advances in Neural Information Processing Systems,2021

5. Marquet, C. , Heinzinger, M. , Olenyi, T. , Dallago, C. , Erckert, K. , Bernhofer, M. , Nechaev, D. , Rost, B. : Embeddings from protein language models predict conservation and variant effects. Human genetics, 1–19 (2021)

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering;ACS Central Science;2024-02-05

2. Perfecting antibodies with language models;Nature Biotechnology;2023-10-16

3. Machine Learning-Guided Protein Engineering;ACS Catalysis;2023-10-13

4. Genomic language model predicts protein co-regulation and function;2023-04-08

5. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics;2023-01-15