Evaluating the representational power of pre-trained DNA language models for regulatory genomics-Reference-Cited by-同舟云学术

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

Published:2024-03-04 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Tang Ziqi^ORCID,Koo Peter K^ORCID

Abstract

ABSTRACTThe emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity ofcis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding ofcis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

Publisher

Cold Spring Harbor Laboratory

Reference84 articles.

1. Devlin, J. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 1810.04805 (2018).

2. OpenAI. Gpt-4 technical report. arXiv 2303.08774 (2023).

3. Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

4. Wei, J. , et al. Emergent abilities of large language models. arXiv 2206.07682 (2022).

5. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 118 (2021).

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unlocking gene regulation with sequence-to-function models;Nature Methods;2024-08

2. Interpretably deep learning amyloid nucleation by massive experimental quantification of random sequences;2024-07-17