GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences-Reference-Cited by-同舟云学术

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Published:2023-06-13 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Fishman Veniamin^ORCID,Kuratov Yuri^ORCID,Petrov Maxim^ORCID,Shmelev Aleksei^ORCID,Shepelin Denis^ORCID,Chekanov Nikolay^ORCID,Kardymon Olga^ORCID,Burtsev Mikhail^ORCID

Abstract

AbstractRecent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA-LM, a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36,000 base pairs. Notably, integration of the newly-developed Recurrent Memory mechanism allows these models to process even larger DNA segments. We provide pre-trained versions of GENA-LM, demonstrating their capability for fine-tuning and addressing a spectrum of complex biological tasks with modest computational demands. While language models have already achieved significant breakthroughs in protein biology, GENA-LM showcases a similarly promising potential for reshaping the landscape of genomics and multi-omics data analysis. All models are publicly available on GitHubhttps://github.com/AIRI-Institute/GENALM and HuggingFacehttps://huggingface.co/AIRI-Institute.

Publisher

Cold Spring Harbor Laboratory

Reference60 articles.

1. Deciphering the multi-scale, quantitative cis-regulatory code

2. Navigating the pitfalls of applying machine learning in genomics

3. Machine learning applications in genetics and genomics

4. Quantitative prediction of enhancer–promoter interactions

5. Sindeeva, M. , Chekanov, N. , Avetisian, M. , Shashkova, T.I. , Baranov, N. , Malkin, E. , Lapin, A. , Kardymon, O. , Fishman, V .: Cell type–specific interpretation of noncoding variants using deep learning–based methods. GigaScience 12, 015 (2023)

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Species-specific design of artificial promoters by transfer-learning based generative deep-learning model;Nucleic Acids Research;2024-05-23

2. GENA-Web - GENomic Annotations Web Inference using DNA language models;2024-04-29

3. Species-aware DNA language models capture regulatory elements and their evolution;Genome Biology;2024-04-02

4. A germline chimeric KANK1-DMRT1 transcript derived from a complex structural variant is associated with a congenital heart defect segregating across five generations;Chromosome Research;2024-03-19

5. Evaluating the representational power of pre-trained DNA language models for regulatory genomics;2024-03-04