Abstract
Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA langauage model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.
Publisher
Cold Spring Harbor Laboratory
Reference35 articles.
1. CTCF as a regulator of alternative splicing: new tricks for an old player
2. “Effective gene expression prediction from sequence by integrating longrange interactions;Nature methods,2021
3. “Coordinating regulation of gene expression in cardiovascular disease: interactions between chromatin modifiers and transcription factors;Frontiers in cardiovascular medicine,2017
4. Rishi Bommasani et al. “On the opportunities and risks of foundation models”. In: arXiv preprint arXiv:2108.07258 (2021).
5. “Language models are few-shot learners;Advances in neural information processing systems,2020