Author:
Hayes William S.,Borodovsky Mark
Abstract
In this report we address the problem of accurate statistical modeling of DNA sequences, either coding or noncoding, for a bacterial species whose genome (or a large portion) was sequenced but not yet characterized experimentally. Availability of these models is critical for successful solution of the genome annotation task by statistical methods of gene finding. We present the method, GeneMark-Genesis, which learns the parameters of Markov models of protein-coding and noncoding regions from anonymous bacterial genomic sequence. These models are subsequently used in the GeneMark and GeneMark.hmm gene-finding programs. Although there is basically one model of a noncoding region for a given genome, several models of protein-coding region are automatically obtained by GeneMark-Genesis. The diversity of protein-coding models reflects the diversity of oligonucleotide compositions, particularly the diversity of codon usage strategies observed in genes from one and the same genome. In the simplest and the most important case, there are just two gene models—typical and atypical ones. We show that the atypical model allows one to predict genes that escape identification by the typical model. Many genes predicted by the atypical model appear to be horizontally transferred genes. The early versions of GeneMark-Genesis were used for annotating the genomes of Methanoccocus jannaschii and Helicobacter pylori. We report the results of accuracy testing of the full-scale version of GeneMark-Genesis on 10 completely sequenced bacterial genomes. Interestingly, the GeneMark.hmm program that employed the typical and atypical models defined by GeneMark-Genesis was able to predict 683 new atypical genes with 176 of them confirmed by similarity search.
Publisher
Cold Spring Harbor Laboratory
Subject
Genetics (clinical),Genetics
Reference33 articles.
1. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
2. The Complete Genome Sequence of
Escherichia coli
K-12
3. GeneMark: Parallel gene recognition for both DNA strands.;Borodovsky;Comp. Chem.,1993
4. Statistical features in the Escherichia coli genome functional primary structure. II. Non-homogeneous Markov chains.;Borodovsky;Mol. Biol.,1986
5. Statistical features in the E. coli genome functional primary structure. III. Computer recognition of protein coding regions.;Mol. Biol.,1986
Cited by
101 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献