Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features-Reference-Cited by-同舟云学术

Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features

Published:2020-09-11 Issue:1 Volume:21 Page:
ISSN:1471-2164
Container-title:BMC Genomics
language:en
Short-container-title:BMC Genomics

Author:

Zhao Zhixun,Zhang Xiaocai,Chen Fang,Fang Liang,Li Jinyan

Abstract

Abstract Background DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.

Publisher

Springer Science and Business Media LLC

Subject

Genetics,Biotechnology

Link

https://link.springer.com/content/pdf/10.1186/s12864-020-07033-8.pdf

Reference44 articles.

1. Rathi P, Maurer S, Summerer D. Selective recognition of N 4-methylcytosine in DNA by engineered transcription-activator-like effectors. Philos Trans R Soc B Biol Sci. 2018; 373(1748):20170078.

2. Stoiber MH, Quick J, Egan R, Lee JE, Celniker SE, Neely R, Loman N, Pennacchio L, Brown JB. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. BioRxiv. 2016:094672.

3. Chen K, Zhao BS, He C. Nucleic acid modifications in regulation of gene expression. Cell Chem Biol. 2016; 23(1):74–85.

4. Davis BM, Chao MC, Waldor MK. Entering the era of bacterial ep igenomics with single molecule real time DNA sequencing. Curr Opin Microbiol. 2013; 16(2):192–8.

5. Korlach J, Turner SW. Going beyond five bases in DNA sequencing. Curr Opin Struct Biol. 2012; 22(3):251–61.

Cited by 24 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-scale DNA language model improves 6 mA binding sites prediction;Computational Biology and Chemistry;2024-10

2. Using a hybrid neural network architecture for DNA sequence representation: A study on N4-methylcytosine sites;Computers in Biology and Medicine;2024-08

3. DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features;Computers in Biology and Medicine;2024-03

4. 4mC-CGRU: Identification of N4-Methylcytosine (4mC) sites using convolution gated recurrent unit in Rosaceae genome;Computational Biology and Chemistry;2023-12

5. Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning;Frontiers in Genetics;2023-08-21