Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

Author:

Bhandari Nikita1,Khare Satyajeet2,Walambe Rahee34,Kotecha Ketan13

Affiliation:

1. Computer Science, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, MH, India

2. Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, MH, India

3. Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune, Maharashtra, India

4. Electronics and Telecommunication Dept, Symbiosis Institute of Technology, Pune, Maharashtra, India

Abstract

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.

Publisher

PeerJ

Subject

General Computer Science

Reference49 articles.

1. Pol II promoter prediction using characteristic 4-Mer Motifs: a machine learning approach;Anwar;BMC Bioinformatics,2008

2. MEME suite: tools for motif discovery and searching;Bailey;Nucleic Acids Research,2009

3. FootPrinter: a program designed for phylogenetic footprinting;Blanchette;Nucleic Acids Research,2003

4. Random forests. Machine learning: 5–32;Breiman,2001

5. Realistic artificial DNA sequences as negative controls for computational genomics;Caballero;Nucleic Acids Research,2014

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3