Are genomic language models all you need? Exploring genomic language models on protein downstream tasks-Reference-Cited by-同舟云学术

Are genomic language models all you need? Exploring genomic language models on protein downstream tasks

Published:2024-08-30 Issue:9 Volume:40 Page:
ISSN:1367-4811
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Boshar Sam¹,Trop Evan¹,de Almeida Bernardo P²^ORCID,Copoiu Liviu³,Pierrot Thomas¹

Affiliation:

1. InstaDeep , Cambridge, MA 02142, United States

2. InstaDeep , Paris 75010, France

3. InstaDeep , London W2 1AY, United Kingdom

Abstract

Abstract Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics. Availability and implementation We make our inference code, 3mer pre-trained model weights and datasets available.

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae529/58972224/btae529.pdf

Reference46 articles.

1. Assessment of hard target modeling in casp12 reveals an emerging role of alignment-based contact prediction methods;Abriata;Proteins,2018

2. Effective gene expression prediction from sequence by integrating long-range interactions;Avsec;Nat Methods,2021

3. The protein data bank;Berman;Nucleic Acids Res,2000