Abstract
Approaches based on molecular evolution have organized natural proteins into a hierarchy of families, superfamilies, and folds, which are often pictured as islands in a great sea of unrealized and generally non-functional polypeptides. In contrast, approaches based on information theory have substantiated a mostly random scatter of natural proteins in global sequence space. We evaluate these opposing views by analyzing fragments of a given length derived from either a natural dataset or different random models. For this, we compile distances in sequence space between fragments within each dataset and compare the resulting distance distributions between sets. Even for 100-mers, more than 95% of distances can be accounted for by a random sequence model that incorporates the natural amino acid frequency of proteins. When further accounting for the specific residue composition of the respective fragments, which would include biophysical constraints of protein folding, more than 99% of all distances can be modeled. Thus, while the local space surrounding a protein is almost entirely shaped by common descent, the global distribution of proteins in sequence space is close to random, only constrained by divergent evolution through the requirement that all intermediates connecting two forms in evolution must be functional.Significance StatementWhen generating new proteins by evolution or design, can the entire sequence space be used, or do viable sequences mainly occur only in some areas of this space? As a result of divergent evolution, natural proteins mostly form families that occupy local areas of sequence space, suggesting the latter. Theoretical work however indicates that these local areas are highly diffuse and do not dramatically affect the statistics of sequence distribution, such that natural proteins can be considered to effectively cover global space randomly, though extremely sparsely. By comparing the distance distribution of natural sequences to that of various random models, we find that they are indeed distributed largely randomly, provided that the amino acid composition of natural proteins is respected.
Publisher
Cold Spring Harbor Laboratory
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献