DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome-Reference-Cited by-同舟云学术

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Published:2021-02-04 Issue:15 Volume:37 Page:2112-2120
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Ji Yanrong¹,Zhou Zhihan²,Liu Han²,Davuluri Ramana V³^ORCID

Affiliation:

1. Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA

2. Department of Computer Science, Northwestern University, Evanston, IL 60208, USA

3. Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA

Abstract

Abstract Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. Availability and implementation The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT). Supplementary information Supplementary data are available at Bioinformatics online.

Funder

National Library of Medicine/National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab083/36253031/btab083.pdf

Reference57 articles.

1. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning;Alipanahi;Nat. Biotechnol,2015

2. Determinants of enhancer and promoter activities of regulatory elements;Andersson;Nat. Rev. Genet,2020

3. Mapping genome-wide transcription-factor binding sites using DAP-seq;Bartlett;Nat. Protoc,2017

4. Representation learning: a review and new perspectives;Bengio;IEEE Trans. Pattern Anal,2013

5. Genome structure described by formal languages;Brendel;Nucleic Acids Res,1984

Cited by 251 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. miTDS: Uncovering miRNA-mRNA interactions with deep learning for functional target prediction;Methods;2024-03

2. Self-supervised Learning for DNA sequences with circular dilated convolutional networks;Neural Networks;2024-03

3. State of the Art Technologies for High Yield Heterologous Expression and Production of Oxidoreductase Enzymes: Glucose Oxidase, Cellobiose Dehydrogenase, Horseradish Peroxidase, and Laccases in Yeasts P. pastoris and S. cerevisiae;Fermentation;2024-02-04

4. PolyAMiner-Bulk is a deep learning-based algorithm that decodes alternative polyadenylation dynamics from bulk RNA-seq data;Cell Reports Methods;2024-02

5. Using large scale transfer learning to highlight the role of chromatin state in intron retention;2024-01-30