GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

Author:

Brůna Tomáš1,Lomsadze Alexandre2,Borodovsky Mark123

Affiliation:

1. School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA

2. Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

3. School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

Abstract

Abstract We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.

Funder

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

General Medicine

Reference31 articles.

1. Predicting genes in single genomes with AUGUSTUS;Hoff;Curr. Protoc. Bioinformatics,2019

2. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm;Lomsadze;Nucleic Acids Res.,2014

3. Genome annotation in plants and fungi: EuGene as a model platform;Foissac;Curr. Bioinformatics,2008

4. EuGene: an automated integrative gene finder for eukaryotes and prokaryotes;Sallet;Methods Mol. Biol.,2019

5. Next generation genome annotation with mGene.ngs;Behr;BMC Bioinformatics,2010

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3