Abstract
AbstractAssembled genomes and their associated annotations have transformed our study of gene function. However, each new assembly generates new gene models. Inconsistencies between annotations likely arise from biological and technical causes, including pseudogene misclassification, transposon activity, and intron retention from sequencing of unspliced transcripts. To evaluate gene model predictions, we developed reelGene, a pipeline of machine learning models focused on (1) transcription boundaries, (2) mRNA integrity, and (3) protein structure. The first two models leverage sequence characteristics and evolutionary conservation across related taxa to learn the grammar of conserved transcription boundaries and mRNA sequences, while the third uses conserved evolutionary grammar of protein sequences to predict whether a gene can produce a protein. Evaluating 1.8 million gene models in maize, reelGene found that 28% were incorrectly annotated or nonfunctional. By leveraging a large cohort of related species and through learning the conserved grammar of proteins, reelGene provides a tool for both evaluating gene model accuracy and genome biology.
Publisher
Cold Spring Harbor Laboratory
Reference66 articles.
1. Twenty years of plant genome sequencing: achievements and challenges;Trends Plant Sci,2022
2. Representation and participation across 20 years of plant genome sequencing;Nat Plants,2021
3. Lewin, H. A. et al. The Earth BioGenome Project 2020: Starting the clock. Proc. Natl. Acad. Sci. U. S. A. 119, (2022).
4. Salzberg, S. L . Next-generation genome annotation: we still struggle to get it right. Genome Biol. 20, 92 (2019).
5. Scalzitti, N. , Jeannin-Girardon, A. , Collet, P. , Poch, O. & Thompson, J. D . A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 21, 293 (2020).