No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study

Author:

Dimonaco Nicholas J1ORCID,Aubrey Wayne2ORCID,Kenobi Kim3ORCID,Clare Amanda2ORCID,Creevey Christopher J4ORCID

Affiliation:

1. Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, UK

2. Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK

3. Department of Mathematics, Aberystwyth University, Aberystwyth SY23 3BZ, UK

4. School of Biological Sciences, Queen’s University Belfast, Belfast BT7 1NN, UK

Abstract

Abstract Motivation The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis. Results We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations. Availability and implementation Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Institute of Biological, Environmental and Rural Sciences Aberystwyth PhD fellowship

Biotechnology and Biological Sciences Research Council

Department of Agriculture, Food and the Marine Ireland/DAERA Northern Ireland

European Commission via Horizon 2020

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Cited by 27 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3