Author:
Zhang Daoan,Zhang Weitong,Zhao Yu,Zhang Jianguo,He Bing,Qin Chenchen,Yao Jianhua
Abstract
AbstractPre-trained large language models demonstrate potential in extracting information from DNA sequences, yet adapting to a variety of tasks and data modalities remains a challenge. To address this, we propose DNAGPT, a generalized DNA pre-training model trained on over 200 billion base pairs from all mammals. By enhancing the classic GPT model with a binary classification task (DNA sequence order), a numerical regression task (guanine-cytosine content prediction), and a comprehensive token language, DNAGPT can handle versatile DNA analysis tasks while processing both sequence and numerical data. Our evaluation of genomic signal and region recognition, mRNA abundance regression, and artificial genome generation tasks demonstrates DNAGPT’s superior performance compared to existing models designed for specific downstream tasks, benefiting from pre-training using the newly designed model structure.
Publisher
Cold Spring Harbor Laboratory
Reference53 articles.
1. Mechanism of action of penicillins: a proposal based on their structural similarity to acyl-D-alanyl-D-alanine.
2. Structure of transferrnas: similarity and variability;Wiley Interdisciplinary Reviews: RNA,2012
3. Chen, A. , Sun, Y. , Lei, Y. , Li, C. , Liao, S. , Meng, J. , Bai, Y. , Liu, Z. , Liang, Z. , Zhu, Z. , et al.: Single-cell spatial transcriptome reveals cell-type organization in the macaque cortex. Cell (2023)
4. Theranostic cells: emerging clinical applications of synthetic biology
5. Next-generation DNA sequencing
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献