Gene-language models are whole genome representation learners-Reference-Cited by-同舟云学术

Gene-language models are whole genome representation learners

Published:2024-03-19 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Naidenov Bryan^ORCID,Chen Charles^ORCID

Abstract

AbstractThe language of genetic code embodies a complex grammar and rich syntax of interacting molecular elements. Recent advances in self-supervision and feature learning suggest that statistical learning techniques can identify high-quality quantitative representations from inherent semantic structure. We present a gene-based language model that generates whole-genome vector representations from a population of 16 disease-causing bacterial species by leveraging natural contrastive characteristics between individuals. To achieve this, we developed a set-based learning objective, AB learning, that compares the annotated gene content of two population subsets for use in optimization. Using this foundational objective, we trained a Transformer model to backpropagate information into dense genome vector representations. The resulting bacterial representations, or embeddings, captured important population structure characteristics, like delineations across serotypes and host specificity preferences. Their vector quantities encoded the relevant functional information necessary to achieve state-of-the-art genomic supervised prediction accuracy in 11 out of 12 antibiotic resistance phenotypes.TeaserDeep transformers capture and encode gene language content to derive versatile latent embeddings of microbial genomes.

Publisher

Cold Spring Harbor Laboratory

Reference56 articles.

1. A. Vaswani , N. Shazeer , N. Parmar , J. Uszkoreit , L. Jones , A. N. Gomez , Ł. Kaiser , I. Polosukhin , Attention is all you need. Advances in neural information processing systems 30, (2017).

2. Language models are unsupervised multitask learners;OpenAI blog,2019

3. Bert: Pre-training of deep bidirectional transformers for language understanding;arXiv preprint,2018

4. A text abstraction summary model based on BERT word embedding and reinforcement learning;Applied Sciences,2019

5. in Computational Linguistics: 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Hanoi, Vietnam, October 11–13, 2019;Revised Selected Papers,2020