Abstract
The process of DNA 5-methylcytosine modification has been widely studied in mammals and and plays an important role in epigenetics. Several computational approaches have been developed to aid the identification of methylation sites. In this study, we introduce a novel deep-learning framework MR-DNR that aims at predicting specific methylation sites located in gene promoter regions. The idea is to adapt the name-entity recognition approach to methylation-site prediction. MR-DNA is trained on a stacked model architecture that consists of a pre-trained MuLan-Mehtyl-DistilBERT language model and conditional random field algorithms. The resulting fine-tuned model achieves an accuracy of 95.4% on an independent test dataset. A key advantage of this formulation of the methylation-site identification task is that the input DNA sequence can be of any length, unlike previous methods that predict methylation state on short, fixed-length DNA sequences. For training and testing purposes, we provide a database of DNA sequences containing verified 5mC-methylation sites, obtained from eight human cell lines in ENCODE. Data and code are available athttps://github.com/husonlab/MR-DNA.CCS ConceptsComputing methodologies→Information extraction.
Publisher
Cold Spring Harbor Laboratory