Author:
Hallee Logan,Khomtchouk Bohdan B.
Abstract
AbstractIn this study, we investigate how an organism’s codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement.
Funder
University of Delaware
National Institutes of Health
Publisher
Springer Science and Business Media LLC
Reference39 articles.
1. Angov, E. Codon usage: Nature’s roadmap to expression and folding of proteins. Biotechnol. J. 6, 650–659. https://doi.org/10.1002/biot.201000332 (2011).
2. Inouye, M., Takino, R., Ishida, Y. & Inouye, K. Evolution of the genetic code; evidence from serine codon use disparity in Escherichia coli. PNAS 117(46), 28572–28575. https://doi.org/10.1073/pnas.2014567117 (2020).
3. Nakamura, Y., Gojobori, T. & Ikemura, T. Codon usage tabulated from international DNA sequence databases: Status for the year 2000. Nucleic Acids Res. 28, 292 (2000).
4. Wetterstrand, K.A. The Cost of Sequencing a Human Genome (accessed 1 Jan 2023); https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost.
5. Andrews, S. J. & Rothnagel, J. A. The cost of sequencing a human. Nature Rev. Genet. 15, 193–294. https://doi.org/10.1038/nrg3520 (2014).
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献