Affiliation:
1. School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, P.R. China
2. Hebei
Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, P.R. China
3. School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology,
Qinhuangdao, P.R. China
Abstract
Abstract:
N4-methylcytosine (4mC) is one of the most important epigenetic modifications, which
plays a significant role in biological progress and helps explain biological functions. Although biological
experiments can identify potential 4mC sites, they are limited due to the experimental environment
and labor-intensive process. Therefore, it is crucial to construct a computational model to identify
the 4mC sites. Some computational methods have been proposed to identify the 4mC sites, but
some problems should not be ignored, such as those presented as follows: (1) a more accurate algorithm
is required to improve the prediction, especially for Matthew’s correlation coefficient (MCC);
(2) easier method is needed for clinical research to design medicine or treat disease. Considering these
aspects, an effective algorithm using comprehensible encoding in multiple species was proposed in
this study. Since nucleotide arrangement and its property information could reflect the sequence structure
and function, several feature vectors have been developed based on nucleotide energy information,
trinucleotide energy information, and nucleotide chemical property information. Besides,
feature effect has been analyzed to select the optimal feature vectors for multiple species. Finally, the
optimal feature vectors were inputted into the CatBoost algorithm to construct the identification model.
The evaluation results showed that our study obtained the highest MCC, i.e., 2.5%~11.1%,
1.4%~17.8%, 1.1%~7.6%, and 2.3%~18.0% higher than previous models for the A. thaliana, C. elegans,
D. melanogaster, and E. coli datasets, respectively. These satisfactory results reflect that the
proposed method is available to identify 4mC sites in multiple species, especially for MCC. It could
provide a reasonable supplement for biological research.
Funder
Science Research Project of the Hebei Education Department
Science Research Project of Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data
333 Talent Project of Hebei Province
Hebei Graduate Student Innovation Ability Training Funding Project
Publisher
Bentham Science Publishers Ltd.