Abstract
AbstractLinking DNA sequence to genomic function remains one of the grand challenges in genetics and genomics. Here, we combine large-scale single-molecule transcriptome sequencing of diverse cancer cell lines with cutting-edge machine learning to build LoRNASH, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture—the relative abundances and molecular structures of mRNA isoforms. Owing to its use of the StripedHyena architecture, LoRNASHhandles extremely long sequence inputs (∼65 kilobase pairs), allowing for quantitative, zero-shot prediction of all aspects of transcriptome architecture, including isoform abundance, isoform structure, and the impact of DNA sequence variants on transcript structure and abundance. We anticipate that our public data release and proof-of-concept model will accelerate varying aspects of RNA biotechnology. More broadly, we envision the use of LoRNASHas a foundation for fine-tuning of any transcriptome-related downstream prediction task, including cell-type specific gene expression, splicing, and general RNA processing.
Publisher
Cold Spring Harbor Laboratory
Reference27 articles.
1. Dalla-Torre H , Gonzalez L , Mendoza Revilla J , Lopez Carranza N , Henryk Grywaczewski A , Oteri F , et al. The nucleotide transformer: building and evaluating robust foundation models for human genomics. BioRxiv. 2023 Jan 15;
2. Zhou Z , Ji Y , Li W , Dutta P , Davuluri R , Liu H . DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. arXiv. 2023;
3. Linder J , Srivastava D , Yuan H , Agarwal V , Kelley DR . Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. BioRxiv. 2023 Sep 1;
4. Predicting Splicing from Primary Sequence with Deep Learning
5. Chen K , Zhou Y , Ding M , Wang Y , Ren Z , Yang Y . Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief Bioinformatics. 2024 Mar 27;25(3).