CDBProm: the Comprehensive Directory of Bacterial Promoters

Author:

Martinez Gustavo Sganzerla123ORCID,Perez-Rueda Ernesto4ORCID,Kumar Anuj123ORCID,Dutt Mansi123ORCID,Maya Cinthia Rodríguez5ORCID,Ledesma-Dominguez Leonardo6ORCID,Casa Pedro Lenz7ORCID,Kumar Aditya8ORCID,de Avila e Silva Scheila7ORCID,Kelvin David J123ORCID

Affiliation:

1. Microbiology and Immunology, Dalhousie University , Halifax , Nova Scotia  B3H 4H7 , Canada

2. Pediatrics, Izaak Walton Killam (IWK) Health Center. Canadian Center for Vaccinology (CCfV) , Halifax , Nova Scotia  B3H 4H7, Canada

3. BioForge Canada Limited , Halifax , Nova Scotia B3N 3B9, Canada

4. Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autonóma de México, Unidad Académica del Estado de Yucatán , Mérida  97302 ,  Yucatán , Mexico

5. Facultad de Ciencias e Ingeniería, Universidad Nacional Autonoma de Mexico , Mexico City  04510 , Mexico

6. Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico , Mexico City  04510 , Mexico

7. Biotechnology Institute, Universidade de Caxias do Sul , Caxias do Sul , Rio Grande do Sul  95070-560 , Brazil

8. Molecular Biology and Biotechnology, Tezpur University , Tezpur , Assam  784028 , India

Abstract

Abstract The decreasing cost of whole genome sequencing has produced high volumes of genomic information that require annotation. The experimental identification of promoter sequences, pivotal for regulating gene expression, is a laborious and cost-prohibitive task. To expedite this, we introduce the Comprehensive Directory of Bacterial Promoters (CDBProm), a directory of in-silico predicted bacterial promoter sequences. We first identified that an Extreme Gradient Boosting (XGBoost) algorithm would distinguish promoters from random downstream regions with an accuracy of 87%. To capture distinctive promoter signals, we generated a second XGBoost classifier trained on the instances misclassified in our first classifier. The predictor of CDBProm is then fed with over 55 million upstream regions from more than 6000 bacterial genomes. Upon finding potential promoter sequences in upstream regions, each promoter is mapped to the genomic data of the organism, linking the predicted promoter with its coding DNA sequence, and identifying the function of the gene regulated by the promoter. The collection of bacterial promoters available in CDBProm enables the quantitative analysis of a plethora of bacterial promoters. Our collection with over 24 million promoters is publicly available at https://aw.iimas.unam.mx/cdbprom/

Funder

Canadian Institutes of Health Research

Mpox Rapid Research

Research Nova Scotia

Dalhousie Medical Research Foundation

Li Ka Shing Foundation

Consejo Nacional de Humanidades, Ciencias y Tecnologías

Publisher

Oxford University Press (OUP)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3