An Assessment of Gene Prediction Accuracy in Large DNA Sequences-Reference-Cited by-同舟云学术

An Assessment of Gene Prediction Accuracy in Large DNA Sequences

Published:2000-10-01 Issue:10 Volume:10 Page:1631-1642
ISSN:1088-9051
Container-title:Genome Research
language:en
Short-container-title:Genome Res.

Author:

Guigó Roderic,Agarwal Pankaj,Abril Josep F.,Burset Moisés,Fickett James W.

Abstract

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the ∼200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy ofGENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE,PROCRUSTES, andBLASTX, was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.

Publisher

Cold Spring Harbor Laboratory

Subject

Genetics (clinical),Genetics

Reference25 articles.

1. Abril, J.F. and Guigó, R. 2000. gff2ps: A tool for visualizing genomic annotations. Bioinformatics in press..

2. Dynamite: A flexible code generating language for dynamic programming methods used in sequence comparison.;Birney;Ismb,1997

3. Prediction of complete gene structures in human genomic DNA

4. Finding the genes in genomic DNA

Cited by 164 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. toGC: a pipeline to correct gene model for functional excavation of dark GPCRs in Phytophthora sojae1;Journal of Integrative Agriculture;2024-03

2. Chromosome-level Dinobdella ferox genome provided a molecular model for its specific parasitism;Parasites & Vectors;2023-09-11

3. Molecular Techniques for Plants Gene Expression Analysis at the Transcriptomics Level;Journal of Crop Breeding;2023-05-01

4. Protein phosphorylation database and prediction tools;Briefings in Bioinformatics;2023-03

5. De Novo Assembly and Annotation of the Vaginal Metatranscriptome Associated with Bacterial Vaginosis;International Journal of Molecular Sciences;2022-01-30