Classifying Coding DNA with Nucleotide Statistics-Reference-Cited by-同舟云学术

Classifying Coding DNA with Nucleotide Statistics

Published:2009-01 Issue: Volume:3 Page:BBI.S3030
ISSN:1177-9322
Container-title:Bioinformatics and Biology Insights
language:en
Short-container-title:Bioinform Biol Insights

Author:

Carels Nicolas¹,Frías Diego²

Affiliation:

1. Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

2. Universidade do Estado da Bahia (UNEB), Departamento de Ciências Exatas e da Terra, Salvador, BA, Brazil.

Abstract

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.

Publisher

SAGE Publications

Subject

Applied Mathematics,Computational Mathematics,Computer Science Applications,Molecular Biology,Biochemistry

Link

http://journals.sagepub.com/doi/pdf/10.4137/BBI.S3030

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Common and phylogenetically widespread coding for peptides by bacterial small RNAs;BMC Genomics;2017-07-21

2. A Metagenomic Analysis of Bacterial Microbiota in the Digestive Tract of Triatomines;Bioinformatics and Biology Insights;2017-01-01

3. An Interpretation of the Ancestral Codon from Miller's Amino Acids and Nucleotide Correlations in Modern Coding Sequences;Bioinformatics and Biology Insights;2015-01

4. The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins;Bioinformatics and Biology Insights;2014-01

5. A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences;Bioinformatics and Biology Insights;2013-01