Abstract
AbstractTo explore the concept of a minimal gene set, we clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes. To our knowledge this is the first study of this scale. Clustering resulted in 707,311 clusters of which 224,442 ranged in size from 2 to 2,894 sequences. The resulting clusters allowed us to ask the question: Is a set of proteins conserved across all Proteobacteria? We chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta’ (RpoB/RpoB’), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their distribution in the clusters. We found these proteins to be remarkably conserved. Although thegroELgene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB’ were missing from two genomes and merged in 88 genomes, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB’ proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB’ were almost all endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues represent a significant obstacle for high throughput analyses.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献