Author:
Zhang Xiang,Yang Mingjie,Yin Xunhang,Qian Yining,Sun Fei
Abstract
ABSTRACTDecoding the language of DNA sequences is a fundamental problem in genome research. Mainstream pre-trained models like DNABERT-2 and Nucleotide Transformer have demonstrated remarkable achievements across a spectrum of DNA analysis tasks. Yet, these models still face the pivotal challenge of (1) genetic language diversity, or the capability to capture genetic variations across individuals or populations in the foundation models; (2) model efficiency, specifically how to enhance performance at scalable costs for large-scale genetic foundational models; (3) length extrapolation, or the ability to accurately interpret sequences ranging from short to long within a unified model framework. In response, we introduce DeepGene, a model leveraging Pan-genome and Minigraph representations to encompass the broad diversity of genetic language. DeepGene employs the rotary position embedding to improve the length extrapolation in various genetic analysis tasks. On the 28 tasks in Genome Understanding Evaluation, DeepGene reaches the top position in 9 tasks, second in 5, and achieves the overall best score. DeepGene outperforms other cutting-edge models for its compact model size and superior efficiency in processing sequences of varying lengths. The datasets and source code of DeepGene are available at GitHub (https://github.com/wds-seu/DeepGene).
Publisher
Cold Spring Harbor Laboratory
Reference29 articles.
1. A review of deep learning applications in human genomics using next-generation sequencing data
2. Application of deep learning in genomics
3. Deep learning in omics: a survey and guideline
4. Ashish Vaswani , Noam Shazeer , Niki Parmar , et al. Attention is all you need. Advances in neural information processing systems, 30, 2017. URL https://proceedings.neurips.cc/paper/7181-attention-is-all.
5. Zhongxiao Li , Elva Gao , Juexiao Zhou , et al. Applications of deep learning in understanding gene regulation. Cell Reports Methods, 3(1), 2023. URL https://www.cell.com/cell-reports-methods/pdf/S2667-2375(22)00289-2.pdf. Publisher: Elsevier.