Abstract
AbstractGenerative pre-trained transformers (GPTs) have revolutionized the field of natural language processing. Inspired by this success, we develop a long-context generative model for genomes. Our multiscale transformer model was pre-trained on unannotated bacteriophage genomes with byte-level tokenization. It generatesde novosequences up to 96K base pair with functional genomic structure, including regulatory elements and novel proteins with phage-related functions. Our work paves the way for thede novodesign of the whole genome.
Publisher
Cold Spring Harbor Laboratory
Reference14 articles.
1. Devlin, J. , Chang, M.-W. , Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
2. Language models are few-shot learners;Adv Neural Inf Process Syst,2020
3. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome;Bioinformatics,2021
4. Dalla-Torre, H. et al. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv 2021–2023 (2023).
5. DNA language models are powerful predictors of genome-wide variant effects;Proceedings of the National Academy of Sciences,2023