Abstract
ABSTRACTLarge regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. For example, it is routine for annotation tools to misreport or completely omit genes using alternative start codons. Therefore, we present StORF-Reporter, a tool that takes an annotated genome and returns missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are open reading frames that are delimited by stop codons and thus can capture those genes most often missing in genome annotations.We show that this methodology recovers genes missing from canonical genome annotations. We inspected the results of the genomes of model organisms, the pangenome of Escherichia coli, and a further 6,223 prokaryotic genomes of 179 genera from the Ensembl Bacteria database. StORF-Reporter was able to extend the core, soft-core and accessory gene-collections, identify novel gene families and extend families into additional genera. The high levels of sequence conservation observed between genera suggest that many of these StORF sequences are likely to be functional genes that must now be added to the canonical annotations.
Publisher
Cold Spring Harbor Laboratory