Author:
Cheng Lei,Yu Tong,Aittokallio Tero,Corander Jukka,Khalitov Ruslan,Yang Zhirong
Abstract
Due to their intrinsic properties, DNA molecules commonly exhibit long-range interactions along a linear sequence representation. Taking this information into account when modeling DNA sequences is therefore important for obtaining more accurate sequence-based inference. Many deep learning methods have recently been developed for this purpose, but they still suffer from two major issues. First, the existing methods can only handle short DNA fragments, thereby losing longerrange interactions. Second, the current methods require massive supervised labeling while missing most order information within the sequences. Consequently, there is a need to develop an efficient deep neural network modeling framework to extract wide contextual information for more accurate sequence-based inference tasks. Our new framework, named Revolution, takes full DNA sequences as input, without any condensation, and can give accurate predictions for DNA sequences up to 10kbp. In variant effect prediction, our method increases the Area Under the Receiver Operating Characteristics (AUROC) by 19.61% on 49 human tissues on average. Revolution is also demonstrated to work on the plant sequences by improving 2.36% AUROC on average for predicting open chromatin regions (OCRs). The data, models, and code can be freely accessed athttps://github.com/wiedersehne/Revolution-DNAPretraining.
Publisher
Cold Spring Harbor Laboratory
Reference42 articles.
1. CNN-MGP: convolutional neural networks for metagenomics gene prediction;Interdisciplinary Sciences: Computational Life Sciences,2019
2. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
3. Alsallakh, B. , Kokhlikyan, N. , Miglani, V. , Yuan, J. , and Reblitz-Richardson, O. Mind the pad–CNNs can develop blind spots. arXiv preprint arXiv:2010.02178, 2020.
4. An, W. , Guo, Y. , Bian, Y. , Ma, H. , Yang, J. , Li, C. , and Huang, J. MoDNA: motif-oriented pre-training for DNA language model. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pages 1–5, 2022.
5. Effective gene expression prediction from sequence by integrating long-range interactions;Nature methods,2021