Grammar of protein domain architectures-Reference-Cited by-同舟云学术

Grammar of protein domain architectures

Published:2019-02-07 Issue:9 Volume:116 Page:3636-3645
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc Natl Acad Sci USA

Author:

Yu Lijia^ORCID,Tanwar Deepak Kumar^ORCID,Penha Emanuel Diego S.,Wolf Yuri I.,Koonin Eugene V.^ORCID,Basu Malay Kumar^ORCID

Abstract

From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of “protein languages” in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a “quasi-universal grammar” underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell.

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Reference82 articles.

1. The language of genes

2. The language of the protein universe

3. Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics

4. Unique function words characterize genomic proteins

5. Ruhlen M (1994) The Origin of Language : Tracing the Evolution of the Mother Tongue (Wiley, New York).

Cited by 51 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Range-limited Heaps’ law for functional DNA words in the human genome;Journal of Theoretical Biology;2024-09

2. Clustering protein functional families at large scale with hierarchical approaches;Protein Science;2024-08-15

3. Evolutionary tinkering enriches the hierarchical and nested structures in amino acid sequences;Physical Review Research;2024-05-28

4. In Silico and In Vitro Evaluation of the Molecular Mimicry of the SARS-CoV-2 Spike Protein by Common Short Constituent Sequences (cSCSs) in the Human Proteome: Toward Safer Epitope Design for Vaccine Development;Vaccines;2024-05-14

5. Effect of tokenization on transformers for biological sequences;Bioinformatics;2024-03-29