Abstract
AbstractNew large-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. A new automatic tool, GeneMark-ETP, presented here, finds genes by integration of genomic-, transcriptomic- and protein-derived evidence. The algorithm was developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for gene prediction with ‘high confidence’ and then proceeds with finding the remaining genes across the whole genome. The initial set of parameters of the statistical model is estimated on the training set made from the high confidence genes. Subsequently, the model parameters are iteratively updated in the rounds of gene prediction and parameter re-estimation. Upon reaching convergence, GeneMark-ETP makes the final predictions of the whole complement of genes. The GeneMark-ETP performance was expectably better than the performance of GeneMark-ET or GeneMark-EP+, the gene finders using a single type of extrinsic evidence, either short RNA-seq reads or mapped to genome homologous proteins. Subsequently, for comparisons with the tools utilizing both transcript- and protein-derived extrinsic evidence, we have chosen MAKER2 and a more recent tool, TSEBRA, combining BRAKER1 and BRAKER2. The results demonstrated that GeneMark-ETP delivered state-of-the-art gene prediction accuracy with the margin of outperforming existing approaches increasing for larger and more complex eukaryotic genomes.
Publisher
Cold Spring Harbor Laboratory
Cited by
24 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献