Abstract
AbstractAnkyrin containing proteins are one of the most abundant repeat protein families present in all extant organisms. They are made with tandem copies of similar amino acid stretches that fold into elongated architectures. Here, we build and curated a dataset of 200 thousand proteins that contain 1,2 million Ankyrin regions and characterize the abundance, structure and energetics of the repetitive regions in natural proteins. We found that there is a continuous roughly exponential variety of array lengths with an exceptional frequency at 24 repeats. We describe that individual repeats are seldom interrupted with long insertions and accept few deletions, consistently with the know tertiary structures. We found that longer arrays are made up of repeats that are more similar to each other than shorter arrays, and display more favourable folding energy, hinting at their evolutionary origin. The array distributions show that there is a physical upper limit to the size of an array of Ankyrin repeats of about 120 copies, consistent with the limit found in nature. Analysis of the identity patterns within the arrays suggest that they may have originated by sequential copies of more than one Ankyrin unit.Author summaryRepeat proteins are coded in tandem copies of similar amino acid stretches. We built and curated a large dataset of Ankyrin containing proteins, one of the most abundant families of repeat proteins, and characterized the structure of the arrays formed by the repetitions. We found that large arrays are constructed with repetitions that are more similar to each other than shorter arrays. Also, the largest the array, the more energetically favourable its folding energy is. We speculate about the mechanistic origin of large arrays and hint into their evolutionary dynamics.
Publisher
Cold Spring Harbor Laboratory