Affiliation:
1. Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Research Institute, New Delhi, India
2. ICAR-Indian
Agricultural Statistics Research Institute, New Delhi, India
3. ICAR- National Bureau of Plant Genetic Resources, Pusa,
New Delhi, 110012, India
Abstract
Aim:
The study aimed to develop a robust and more precise 6mA methylation prediction tool
that assists researchers in studying the epigenetic behaviour of crop plants.
Background:
N6-methyladenine (6mA) is one of the predominant epigenetic modifications involved in
a variety of biological processes in all three kingdoms of life. While in vitro approaches are more precise
in detecting epigenetic alterations, they are resource-intensive and time-consuming. Artificial intelligence-
based in silico methods have helped overcome these bottlenecks
Methods:
A novel machine learning framework was developed through the incorporation of four techniques:
ensemble machine learning, hybrid approach for feature selection, the addition of features, such
as Average Mutual Information Profile (AMIP), and bootstrap samples. In this study, four different feature
sets, namely di-nucleotide frequency, GC content, AMIP, and nucleotide chemical properties were
chosen for the vectorization of DNA sequences. Nine machine learning models, including support vector
machine, random forest, k-nearest neighbor, artificial neural network, multiple logistic regression,
decision tree, naïve Bayes, AdaBoost, and gradient boosting were employed using relevant features extracted
through the feature selection module. The top three best-performing models were selected and a
robust ensemble model was developed to predict sequences with 6mA sites.
Results:
EpiSemble, a novel ensemble model was developed for the prediction of 6mA methylation
sites. Using the new model, an improvement in accuracy of 7.0%, 3.74%, and 6.65% was achieved over
existing models for RiceChen, RiceLv, and Arabidopsis datasets, respectively. An R package, EpiSemble,
based on the new model was developed and made available at https://cran.rproject.
org/web/packages/EpiSemble/index.html.
Conclusion:
The EpiSemble model added AMIP as a novel feature, integrated feature selection modules,
bootstrapping of samples, and ensemble technique to achieve an improved output for accurate prediction
of 6mA sites in plants. To our knowledge, this is the first R package developed for predicting
epigenetic sites of genomes in crop plants, which is expected to help plant researchers in their future explorations.
Funder
ICAR-National Fellow Project on PGR Informatics
Publisher
Bentham Science Publishers Ltd.
Subject
Computational Mathematics,Genetics,Molecular Biology,Biochemistry
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献