Abstract
AbstractThe preference for synonymous codons, termed codon usage bias (CUB), is a fundamental feature of coding sequences, with distinct preferences being observed across species, genomes and genes. Accurately quantifying codon usage frequencies is useful for a range of applications, from guiding mRNA vaccine design, to elucidating protein folding and uncovering co-evolutionary relationships. However, current methods are either based on a single genome assembly, lack functional stratification, or are extremely outdated. To address this, we adopted a data-driven approach and developed Codon Usage Bias estimation from RNA-sequencing data (CUBSEQ), a fully automatic meta-analysis pipeline to estimate CUB at the trascriptome-level and for gene panels. Here, we used CUBSEQ to perform, to our knowledge, the largest and most comprehensive CUB analysis of the transcriptome and highly expressed genes inEscherichia coli, using RNA sequencing data from 6,763 samples across 72 strains. By capturing sequence variants of these genes through variant calls, we constructed a per-sample representation of theE. colitranscriptome revealing a rich mutational landscape. We then identified a set of 81 highly expressed genes with consistent expression patterns across strains, sample library size and experimental conditions, and found significant differences in CUB compared to transcriptome-wide genes and alternative codon usage tables. Finally, we found codons with a high relative frequency were often associated with a larger repertoire of isoaccepting tRNAs and not necessarily high tRNA abundance.
Publisher
Cold Spring Harbor Laboratory