Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes-Reference-Cited by-同舟云学术

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

Published:2020-11-10 Issue:1 Volume:21 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Meyer Corentin,Scalzitti Nicolas,Jeannin-Girardon Anne,Collet Pierre,Poch Olivier,Thompson Julie D.^ORCID

Abstract

Abstract Background Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subsequent analyses. Results We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characterization of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction.

Funder

Institut Francais de Bioinformatique

Agence Nationale de la Recherche

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

http://link.springer.com/content/pdf/10.1186/s12859-020-03855-1.pdf

Reference39 articles.

1. Mudge JM, Harrow J. The state of play in higher eukaryote gene annotation. Nat Rev Genet. 2016;17:758–72.

2. Danchin A, Ouzounis C, Tokuyasu T, Zucker J-D. No wisdom in the crowd: genome annotation in the era of big data-current status and future prospects. Microb Biotechnol. 2018;11:588–605.

3. Pertea M, Shumate A, Pertea G, Varabyou A, Breitwieser FP, Chang YC, Madugundu AK, Pandey A, Salzberg SL. Genome Biol. 2018;19:208.

4. Alliance of Genome Resources Consortium. The alliance of genome resources: building a modern data ecosystem for model organism databases. Genetics. 2019;213:1189–96.

5. Zahn-Zabal M, Michel PA, Gateau A, et al. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2020;48(D1):D328–34.

Cited by 22 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The nature and distribution of putative non-functional alleles suggest only two independent events at the origins of Astyanax mexicanus cavefish populations;BMC Ecology and Evolution;2024-04-01

2. toGC: a pipeline to correct gene model for functional excavation of dark GPCRs in Phytophthora sojae1;Journal of Integrative Agriculture;2024-03

3. In-silico characterization of Δ4 and Δ5 desaturases in Symbiodinium microadriaticum and Perkinsus marinus, symbiont and parasitic organisms’ similarities;Marine Biology;2023-12-14

4. Deep proteome coverage advances knowledge of Treponema pallidum protein expression profiles during infection;Scientific Reports;2023-10-25

5. The genome of the toxic invasive species Heracleum sosnowskyi carries an increased number of genes despite absence of recent whole‐genome duplications;The Plant Journal;2023-10-17