Abstract
AbstractProtein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets.Author SummaryProteins can contain one or more domains, regions that are evolutionary independent and convey fiction and function. Here we present our classification of proteins within 48 proteomes provided by the AlphaFold Structural Database. These proteomes span multiple model organisms used in research as a common ground for studying biological principles as well as organisms involved in prevalent human infectious diseases. We classify these domains by our AlphaFold-specific Domain Parser for AlphaFold Models (DPAM), which was previously tested on the human proteome. We find that eukaryotic and bacterial proteomes can be classified to different degrees, with significantly more disordered and low-confident regions in eukaryotic proteins.
Publisher
Cold Spring Harbor Laboratory