Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning-Reference-Cited by-同舟云学术

Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning

Published:2022-01 Issue: Volume:16 Page:117793222211183
ISSN:1177-9322
Container-title:Bioinformatics and Biology Insights
language:en
Short-container-title:Bioinform Biol Insights

Author:

Jha Tony¹,Mendel Jovinna²,Cho Hyuk³,Choudhary Madhusudan²

Affiliation:

1. Department of Mathematics, University of California, Berkeley, Berkeley, CA, USA

2. Department of Biological Sciences, Sam Houston State University, Huntsville, TX, USA

3. Department of Computer Science, Sam Houston State University, Huntsville, TX, USA

Abstract

Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 ( E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance.

Funder

NSF

Publisher

SAGE Publications

Subject

Applied Mathematics,Computational Mathematics,Computer Science Applications,Molecular Biology,Biochemistry

Link

http://journals.sagepub.com/doi/pdf/10.1177/11779322221118335

Reference57 articles.