Unique function words characterize genomic proteins-Reference-Cited by-同舟云学术

Unique function words characterize genomic proteins

Published:2018-06-12 Issue:26 Volume:115 Page:6703-6708
ISSN:0027-8424
Container-title:Proceedings of the National Academy of Sciences
language:en
Short-container-title:Proc Natl Acad Sci USA

Author:

Scaiewicz Andrea,Levitt Michael

Abstract

Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared).

Funder

HHS | NIH | National Institute of General Medical Sciences

Publisher

Proceedings of the National Academy of Sciences

Subject

Multidisciplinary

Reference37 articles.

1. On the Universe of Protein Folds

2. How Many Species Are There on Earth and in the Ocean?

3. Metagenomics and the protein universe

4. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase

5. The language of the protein universe

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Biotechnology in Medicine: Advances-II;Fundamentals and Advances in Medical Biotechnology;2022

2. Improved RAD51 binders through motif shuffling based on the modularity of BRC repeats;Proceedings of the National Academy of Sciences;2021-11-12

3. Bridging Themes: Short Protein Segments Found in Different Architectures;Molecular Biology and Evolution;2021-01-27

4. Bridging themes: short protein segments found in different architectures;2020-12-22

5. Improved RAD51 binders through motif shuffling based on the modularity of BRC repeats;2020-05-15