Abstract
AbstractAssessing the quality of protein-coding gene repertoires is critical in an era of increasingly abundant genome sequences for a diversity of species. State-of-the-art genome annotation assessment tools measure the completeness of a gene repertoire, but are blind to other types of errors, such as gene over-prediction or contamination.We developed OMArk, a software relying on fast, alignment-free sequence comparisons between a query proteome and precomputed gene families across the tree of life. OMArk assesses not only the completeness, but also the consistency of the gene repertoire as a whole relative to closely related species. It also reports likely contamination events.We validated OMArk with simulated data, then performed an analysis of the 1805 UniProt Eukaryotic Reference Proteomes, illustrating its usefulness for comparing and prioritizing proteomes based on their quality measures. In particular, we found strong evidence of contamination in 59 proteomes, and identified error propagation in avian gene annotation resulting from the use of a fragmented zebra finch proteome as reference.OMArk is available on GitHub (https://github.com/DessimozLab/OMArk), as a Python package on PyPi, and as an interactive online tool athttps://omark.omabrowser.org/.
Publisher
Cold Spring Harbor Laboratory