Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins
© Simon and Hancock; licensee BioMed Central Ltd. 2009
Received: 19 March 2009
Accepted: 1 June 2009
Published: 1 June 2009
Amino acid repeats (AARs) are common features of protein sequences. They often evolve rapidly and are involved in a number of human diseases. They also show significant associations with particular Gene Ontology (GO) functional categories, particularly transcription, suggesting they play some role in protein function. It has been suggested recently that AARs play a significant role in the evolution of intrinsically unstructured regions (IURs) of proteins. We investigate the relationship between AAR frequency and evolution and their localization within proteins based on a set of 5,815 orthologous proteins from four mammalian (human, chimpanzee, mouse and rat) and a bird (chicken) genome. We consider two classes of AAR (tandem repeats and cryptic repeats: regions of proteins containing overrepresentations of short amino acid repeats).
Mammals show very similar repeat frequencies but chicken shows lower frequencies of many of the cryptic repeats common in mammals. Regions flanking tandem AARs evolve more rapidly than the rest of the protein containing the repeat and this phenomenon is more pronounced for non-conserved repeats than for conserved ones. GO associations are similar to those previously described for the mammals, but chicken cryptic repeats show fewer significant associations. Comparing the overlaps of AARs with IURs and protein domains showed that up to 96% of some AAR types are associated preferentially with IURs. However, no more than 15% of IURs contained an AAR.
Their location within IURs explains many of the evolutionary properties of AARs. Further study is needed on the types of IURs containing AARs.
Amino acid repeats (AARs) are segments of proteins made up of simple patterns of amino acids, often strings of a single amino acid. They have long been recognized to be common features of eukaryotic proteins [1–4]. Polyglutamine repeats, the most intensively studied class because of their association with human diseases such as Huntington's , tend to be evolutionarily labile, especially when encoded by pure repeats of the codon CAG [6, 7]. Because of this lability, AARs have often been considered to be evolutionarily neutral structures . However, a number of experimental studies [9–12] suggest that AARs play an important role in protein function. Studies of the functions of AAR-containing proteins also suggest that they are preferentially found within certain classes of proteins. From the earliest reports through to the most recent genome-wide surveys in Saccharomyces cerevisiae [3, 13, 14] and mammals  a consistent pattern of association with transcription has emerged for the most common tandem repeat types. Additional associations, notably with protein kinases , suggest possible involvement in cellular signaling networks, which in turn suggest that repeats could play a significant role in the evolution of such networks . Finally, studies of the relationship between morphology and repeat length in dog breeds  have shown that variation at repeat loci can have evolutionarily significant effects on phenotype. Polyalanine repeats have also been found to be involved in a number of genetic diseases, in this case involving developmental defects . Removing a polyalanine tract from murine Hoxd-13 has a direct effect on bone phenotype , again indicating involvement of an AAR in an important biological process.
AAR size difference between orthologous human and mouse proteins correlates with protein nonsynonymous substitution rate . A study of the factors contributing to the evolutionary expansion of polyglutamine repeats in a limited number of human-mouse orthologues  concluded that labile repeats, which are encoded by homogeneous runs of a single codon , have a strong tendency to arise in regions of proteins subject to weaker purifying selection than the protein as a whole, while repeats that are more conserved did not show this tendency. This has been supported recently by a large-scale study of human, mouse and rat repeats . These observations suggest a model for repeat evolution whereby initially labile repeats become fixed when they reach some optimal length range . Human polyglutamine disease genes might then be still evolving towards such an optimum.
Intrinsically unstructured regions (IURs), also called disordered regions, are regions of protein, ranging in size from short loops to complete proteins, that do not form a compact tertiary structure under normal solvation conditions . They have been suggested to be involved in protein-ligand binding, including protein-protein interactions, forming compact structures only when bound to a cognate ligand . Tompa  pointed out that many IURs contain AARs and suggested that IURs may evolve to a considerable extent by the expansion of such repeats. Disordered proteins - that is, proteins primarily made up of IURs - have also been suggested to have lower sequence complexity than ordered proteins . Tompa's suggestion  would be consistent with the relatively rapid sequence evolution of many IURs [27, 28], the observation that highly connected (hub) proteins in protein interaction networks appear to be enriched in AARs and in proteins containing IURs , and the suggestion that evolution of AARs could have an effect on network evolution by altering protein-protein affinities . As Tompa  analyzed only a relatively small set of IURs, his hypothesis raises the question whether AARs show a preferential location in IURs, and whether any such preference could account for the evolutionary properties of the bulk of AARs in a proteome. Such a preference would be consistent with hypotheses on the causation of triplet expansion diseases that invoke destabilization of protein structure as an important causative factor .
A variety of computational methods exist to detect repeated sequences in proteins. These range from SEG, which looks for regions of low complexity , to alignment-based approaches . Here we use an extended definition of amino acid repetition that includes cryptic repeats as measured by the program SIMPLE, which we have previously used to look at AARs in the yeast proteome , as well as tandem AARs. This allows us to study repeats below the normal threshold taken for tandem repeats (five amino acids) and regions with significant biases in amino acid content that are not tandem in nature but may have originated from tandem repeats (C4 repeats; see Materials and methods for more detail).
Using a set of orthologues to human genes from four species (chimpanzee, mouse, rat and chicken; Pan troglodytes, Mus musculus, Rattus norvegicus and Gallus gallus) we show that the most common AARs show strong preferences to be located within IURs in all five proteomes. We also confirm that sequences flanking AARs evolve more rapidly than the remainder of their respective proteins. We conclude that the forces shaping the evolution of IURs and AARs are strongly linked, although AARs are present in only a subset of IURs.
Comparing the frequencies of homogeneous C4 repeat types with their tandem equivalents showed significant correlations (P < 0.01 or less after Bonferroni correction) ranging from 0.555 (chicken) to 0.718 (rat). Despite this broad similarity it was noteworthy that L4 repeats were absent amongst C4 repeats, although relatively common among tandem repeats.
The frequency distributions of the tandem repeat types are highly similar between the four mammals, with correlation coefficients > 0.99 (P << 0.001) for all six pairwise comparisons. The distribution for chicken correlates less well with those seen in mammals, showing correlation coefficients ranging from 0.894 (human-chicken) to 0.929 (rat-chicken). In general, chicken proteins contained fewer tandem repeats than mammalian proteins (961 in total, compared to 1,940, 1,792, 1,723 and 1,703 for human, chimpanzee, mouse and rat, respectively). Serine tandem repeats were less extreme in this respect, chicken proteins containing 193 repeats compared to 241, 230, 219 and 215 for the mammals.
We also calculated inter-species correlation coefficients between the frequencies of the commonest homogeneous C4 repeats. These C4 repeats also showed strong and significant (P << 0.001) correlations between frequencies in all five species, ranging from 0.870 for chimpanzee-rat to 0.989 for human-chimpanzee. C4 repeats were rarer in chicken proteins than mammalian proteins, glycine (G4) and glutamine (Q4) C4 repeats being particularly underrepresented in chicken.
It has been suggested that regions surrounding tandem repeats are under weaker purifying selection than the remainder of the protein they are embedded in [7, 21, 22]. Recent evidence also suggests that repeat-containing proteins evolve more rapidly than non-repeat-containing proteins . IURs, on average, also show more rapid evolution than the average protein . To confirm that repeats are located in regions under relatively weak purifying selection we measured pairwise protein sequence distances between orthologues. Proteins were subdivided into those with conserved repeats (that is, present in both species) and non-conserved repeats (present in only one), as previous analyses suggested that only non-conserved repeats lie in regions of lower purifying selection .
Mean divergences of repeat flanks versus protein remainder
Regression results of repeat flank divergence on protein remainder divergence
Functional (Gene Ontology term) association
A number of authors have discussed associations of tandem and cryptic AARs with transcription factors and protein kinases in particular [1, 3, 13–15, 34–36]. Here we consider the Gene Ontology (GO) term associations of repeat-containing members of our orthologue set in comparison with the rest of the set. We looked for significant associations (P < 0.05 after adjustment for false discovery rate) at levels 3 and 4 of the GO molecular function hierarchy. We carried out the analyses for human and chicken to characterize any differences reflected in the different repeat frequencies seen in the chicken and mammal proteomes.
C4 repeats showed fewer common associations between the human and chicken proteins sets. The only shared association was found for P4 repeats with RNA binding (level 3: nucleic acid binding). In humans, Q4 repeats showed qualitatively similar associations to those seen for tandem Q repeats. E4 repeats also showed an association with cytoskeleton protein binding in chicken, which is to some extent similar to the cytoplasmic roles identified for tandem E repeats.
Domain and intrinsically unstructured region associations
These proportions represent a lower bound on the proportion of repeats lying within structured regions of proteins because structures have not been determined for all domains. An approximate upper bound can be estimated by considering the proportion lying within domains identified by InterProScan searches (excluding PANTHER; see Materials and methods). Many of these represent regions of proteins with functional associations but no known structure. Between 25% (for Q) and 95% (L) of tandem repeats lay within domains identified by InterProScan. Slightly lower proportions, between 0% (A4) and 40% (E4) of common homogeneous C4 repeats also lay within identifiable domains.
Identifiable domains most frequently associated with tandem amino acid repeat types
Number of hits
% of repeats
Protein kinase-like (PK-like)
Protein kinase-like (PK-like)
Quinoprotein alcohol dehydrogenase-like
Identifiable domains most frequently associated with cryptic amino acid repeat types
Number of hits
% of repeats
Rm1C like cupin
Protein kinase-like (PK-like)
MYT1 (myelin transcription factor-like)
Protein kinase-like (PK-like)
We then considered the locations of tandem and C4 repeats compared to those of IURs. We predicted IURs using the RONN (Regional Order Neural Network) algorithm , which we selected because of its good performance, code accessibility and because it does not explicitly include information on the chemical properties of individual amino acids in its algorithm (although it may do so implicitly) - we preferred such a predictor as including chemical properties would introduce circularity into the analysis as we were investigating the propensity of particular chemical entities to lie within IURs.
Most repeats showed a strong tendency to lie in unstructured regions; for tandem repeats the proportions lying within unstructured regions ranged from 96% for E and S to 67% for A, compared to 22% for the average amino acid within a protein. The exceptions were L repeats, which were predicted to be predominantly ordered. Among C4 repeats, all the common repeat types again showed a strong preference for highly disordered regions. As for tandem repeats, E4 repeats showed the highest level of disorder while A4 showed a higher degree of order. Corresponding tandem and C4 repeats showed similar distributions between ordered and disordered regions. The exceptions to this trend were Gln repeats, which showed a higher tendency to be within structured regions as C4 repeats (32%) than as tandem repeats (13%).
Finally, we considered the proportion of IUR regions that contain an AAR. These proportions differ depending on the minimum length permitted for an IUR. For a minimum IUR length of 10, on average 85% of proteins contained a predicted IUR. Twenty to 21% of mammalian proteins and 13% of chicken proteins contained some kind of tandem AAR and 12% of mammalian proteins and 9% of chicken proteins contained some kind of C4 repeat; 4.6% of IURs contained a tandem AAR and 0.5% a C4 AAR. The proportion of proteins containing an IUR reported here is higher than the generally accepted proportion of around 40% [40, 41]. We therefore investigated whether a longer length cut-off for our definition of an IUR would significantly affect these proportions. At a cut-off of 50 residues, 34% of proteins contain an IUR, which is similar to the proportion reported previously. Under this definition, 13% of IURs contained a tandem AAR and 2% a C4 AAR.
Comparison of predictions on locations of tandem repeats by three IUR predictors
Although tandem repeats of amino acids are easily recognized features of proteins and have been extensively studied, protein sequences show more widespread repetitive features. This is shown by the high proportion of proteins containing repetitive segments - approximately 50% as measured by SEG  and over 70% of the S. cerevisiae proteome as measured by SIMPLE . In this study we have compared the frequencies of tandem repeats with those of C4 repeats (repetitive regions with a local overrepresentation of motifs of length four residues) using SIMPLE, which has the advantage that it identifies explicitly the overrepresented motif in a given region. We have carried out this comparison in a large set of proteins orthologous between four mammals and chicken, which is the most closely related non-mammalian species with a sequenced genome. This allows us to compare repeat frequencies both between types and between species.
After excluding C4 motifs that overlap tandem repeats, many of the C4 motifs detected in these genomes are clearly related to common tandemly repeated amino acids (six of the seven most common tandem amino acid types in Figure 1a are mirrored by the six most common homogeneous C4 repeat types in Figure 1b), suggesting that the underlying mechanisms that gives rise to them is similar. This is also reflected in the high correlations seen between the frequencies of tandem repeats and their respective homogeneous C4 repeats. Tandem AARs most likely evolve by replication slippage, as they evolve more rapidly if they are encoded by pure codon repeats than interrupted codon repeats [6, 13]. Dieringer and Schlötterer  introduced a novel, slippage-related process they called indel slippage that acts in a non-repeat-length-dependent manner on repeated motifs as short as a single nucleotide. Such a mechanism could contribute to the evolution of C4 repeats and other cryptically repetitive sequences [47, 48] and could give rise to differences in the frequencies of tandem and cryptic repeats.
The biggest difference in frequency between tandem and cryptic repeats was seen for Leu, which is rare among C4 repeats. In addition, Q4 repeats are by far the most common class of C4 repeats while Gln is only the seventh most numerous class of tandem repeat in our sample. These large differences could reflect differences in underlying mechanisms (although this seems superficially unlikely as Q tandem repeats are known to undergo rapid evolution [6, 49]) but could also reflect differential selective forces (acting strongly against L4 repeats and Q tandem repeats but less so against their counterparts).
Repeat frequencies were highly similar between the mammals, but the chicken proteome showed a distinct frequency distribution in which most repeat frequencies were lower. A partial exception to this pattern were tandem S repeats, which although rarer in chicken than in mammals, were the most common class in the chicken proteome. A trivial explanation for these differences could be the currently lower quality of the chicken genome sequence. However, this is unlikely to be the main explanation as the dataset we used contained only clearly identifiable orthologues. Another, and more interesting, possibility is that the lower frequency in chicken is the result of the general reduction of genome size in birds. The chicken genome is approximately one-third the size of the human genome  while bird genomes in general are approximately half the size of mammalian genomes . Analysis of the evolution of bird genome size indicates that genome shrinkage took place in the saurischian lineage leading to the birds circa 200 to 300 million years ago and that this was accompanied by a reduction in the genome fraction of repetitive elements . A global correlation of genome sequence repetition with genome size has also been described [52, 53]. The lower frequency of amino acid repeats in chicken proteins may therefore reflect a parallel process of loss of transposable elements and tandem and cryptic repeats in that evolutionary lineage. A possible explanation for the stronger conservation of S repeats between mammals and chicken than other repeat types is that they play a less dispensable role in protein function; serine-rich domains (RS domains) are intimately involved in alternative splicing  and it is possible that this role is sufficiently important to ensure their retention.
Previous analyses of the evolution of Gln repeats have suggested that in the early stages of their emergence, when encoded by pure codon repeats, they appear preferentially in regions of proteins that are subject to relatively low levels of purifying selection (that is, regions that evolve more quickly than the rest of the protein) [7, 21, 22]. In this study we have analyzed the evolution of regions flanking tandem and C4 AARs in human-rodent and human-chicken comparisons and show the same trend, confirming that the majority of tandem and C4 repeats in proteins emerge in rapidly evolving subregions. We also confirm earlier suggestions [21, 22] that conserved repeats lie in relatively more conserved protein subregions than non-conserved repeats and show that conserved AARs tend to lie in more conserved proteins than non-conserved AARs. In addition, we observe elevated sequence differences around conserved repeats of both types, although this elevation is less extreme than is observed for non-conserved repeats. The latter result differs from a previous study that did not find a difference between flanking regions and the remainder of the proteins for conserved AARs . However, that study only considered a relatively small number of proteins and so most likely failed to detect this difference due to a lack of statistical power. Generally, the results are consistent with a model of repeat evolution whereby repeats tend to emerge in less-conserved regions of proteins and become frozen in length as they reach a length at which they are close to a threshold at which they may cause deleterious phenotypes [16, 21] but they also suggest that the regions in which repeats become fixed may continue to evolve relatively rapidly after repeat fixation.
IURs are regions of proteins that do not form stable tertiary structures under native conditions. Analyses of the extent of disorder in whole genomes suggest that in eukaryotes more than 40% of proteins are either completely disordered or contain significant regions of disorder [40, 41]. In this dataset we find 34% of proteins to contain IURs of length > 50 and 85% to contain an IUR of length > 10. These regions are thought to form flexible regions of proteins that might have a number of functions, including binding to other proteins and small molecules and providing flexibility in multidomain proteins. In an analysis of repeat content of a relatively small number of intrinsically unstructured protein regions, Tompa  identified an apparently strong role for AARs in IUR evolution. The definition of 'repeats' in his analysis is different from ours as it included longer, complex repeated motifs as well as simple sequence repeats, but some simple sequence repeats did appear in his results. This raises the question whether there is a real association of simple AARs with IURs, and whether an association of this type can account for the evolutionary dynamics of AARs. Here we have investigated this by considering the overlap between tandem and C4 repeats and, first, domains identifiable searching the SUPERFAMILY and InterPro databases, and second, unstructured regions predicted by the RONN predictor. The majority of AARs, with the exception of L tandem repeats, lie within IURs predicted by RONN (Figure 6).
We obtained inconsistent predictions on the level of structure shown by A repeats. They were predicted to be predominantly unstructured by two methods, RONN and DISOPRED, but not by a third, IUPRED. This disagreement may reflect the different methodologies employed by the different algorithms as IUPRED takes account of the chemical characteristics of the sequence being analyzed whereas RONN and DISOPRED use structural analyses of proteins. The ambiguous position of A in these analyses is interesting in the light of its role as the second major cause of human repeat expansion disease, after Q. Gln repeats are notable in showing markedly higher proportions of disorder as tandem repeats than as C4 repeats, suggesting that expansion of Q repeats could have a destabilizing effect on proteins, as suggested previously .
Seven of the eight most common tandemly repeated amino acids in our dataset correspond to the seven disorder-promoting amino acids defined by Dunker et al. . Lise and Jones  in their study of common amino acid patterns in unstructured regions also identified a number of patterns similar to the most common C4 repeats, notably E- and P-rich regions. A strong element of the purifying selection acting against the emergence of AARs within folded regions of proteins therefore appears to be selection against their propensity to lower the stability of these regions. Interestingly, as noted by Kreil and Kreil , N repeats are much rarer than Q repeats - indeed, in our analysis of human proteins we found only four tandem N repeats. This observation may reflect the propensity of Asn to promote order  and consequent purifying selection acting against the appearance of N repeats in unstructured regions. A similar argument may apply to D and E repeats - Glu, which is common in AARs, is disorder-promoting whereas Asp, which is rare in AARs, is not. In this context, it is noteworthy that although E repeats are the most common class in mammals and the most often predicted to be unstructured, they are also, after L repeats, the class most commonly found associated with SUPERFAMILY and InterPro domains. This raises the question whether the domains in which they are located tend to be close to the threshold of instability. Mean RONN scores of domains containing E repeats are 0.44 for SUPERFAMILY and 0.46 for InterPro domains. These compare to means for all domains containing repeats of 0.43 for SUPERFAMILY domains and 0.41 for InterPro domains. The mean for E repeats in SUPERFAMILY domains is typical of all repeat-containing domains, but that for InterPro domains is the highest amongst all repeat types. As most of the domains containing E repeats are InterPro and not SUPERFAMILY domains, this raises the possibility that some E repeat-containing InterPro domains are relatively unstable.
L tandem repeats form interesting exceptions to the general association of AARs with unstructured regions as they are predicted to be 100% structured. The amino acids found in tandem repeats tend to be hydrophilic; all the most hydrophilic amino acids  are found in the class of common tandem AARs - the only strongly hydrophobic amino acid in this class is Leu. Hydrophobic amino acids tend to occupy buried positions within proteins, so it is not surprising that Leu repeats show a high propensity to be structured. In earlier analyses, Leu repeats have been found to be concentrated close to the amino termini of proteins [15, 59], presumably forming part of the hydrophobic region of signal sequences, although Leu may also contribute to transmembrane segments of proteins and more generally to protein cores and stabilizing secondary and tertiary structure .
The majority of AARs have arisen during evolution within protein regions with the characteristics of IURs. This is true both of tandem and cryptic repeats, which have many common characteristics such as relative frequency and, to a lesser extent, GO associations. The dynamics of the evolution of most AARs are, therefore, likely to mirror those of IURs. Some, but not all, IURs, evolve more rapidly than the proteins they are part of [27, 28]. Despite this, our results suggest that only a small subset (no more than 15%) of IURs contain AARs. This raises the question whether there are specific subclasses of rapidly evolving IURs that have a higher propensity to evolve AARs. As AARs tend to be associated with transcription and cell signaling, it is possible that proteins with these types of functions have particular types of IUR that might predispose them to evolve repeats.
IURs are thought to play an important role in protein-protein interactions. Repeat accumulation may, therefore, play a role in the evolution of protein-protein interactions in transcriptional and signaling networks by expanding the repertoire of disordered regions. Because they evolve rapidly, repeat sequences potentially provide a means for organisms to rapidly tune their transcriptional and signaling protein-protein interaction networks .
Leu (and Ala) repeats form a special class in being hydrophobic amino acids that commonly form repeat structures. Leu repeats are consistently predicted to be structured, and Ala repeats often are. Glu repeats, which are very common, are also often found within structured regions, although Glu is disorder-promoting. Further studies of the evolution of these repeat classes are therefore merited as repeat variation in structured regions may be expected to have significant effects on protein structure and/or stability.
Materials and methods
For the analyses presented in this paper we prepared a set of orthologous proteins present in all five species, extracted from the Ensembl database version 41 . We downloaded mouse, rat, chimp and chicken proteins that are orthologous to human proteins. All proteins were chosen to be orthologous to the same human protein. We excluded any duplicate entries, any sequences that were under 300 amino acids (thereby removing proteins too short to allow meaningful analysis of sequences' flanking repeats) and any human and mouse protein that did not have a Swissprot  identifier. The final dataset consisted of 5,815 orthologous proteins.
Identification of amino acid repeats
Perfect tandem AARs were identified using a standalone JAVA program. Tandem repeats are defined here as continuous runs of a single amino acid with a length of more than four residues.
Cryptic repeats were identified using version 3 of the program SIMPLE  with modifications to increase its speed (S Greenaway, MS and JMH, unpublished ). To distinguish C4 repeats from overlapping tandem repeats, we excluded all C4 repeats that overlapped tandem repeats from further analysis. For any given repeat unit size, SIMPLE identifies sequence windows that achieve simplicity scores above any value seen in 100 randomized versions of the test sequence. The repeat unit corresponding to this window is recorded as a significantly simple motif (SSM). We considered repeats with repeat motifs of length four, which we call C4 repeats. For homogeneous motifs such as QQQQ (Q4), these correspond to regions that fall just below our definition of a tandem AAR. By looking at C4, rather than tandem repeats of a shorter length, we were also able to look at interrupted, tandem-like cryptic structures. It should be noted that using longer motif lengths would essentially replicate searches for tandem repeats of different lengths. Using shorter motif lengths (one to three) produces results more similar to those for tandem repeats than those seen for C4 repeats (data not shown).
Evolutionary rate analysis
To confirm whether the flanking regions of AARs have evolved more rapidly than the whole protein, we constructed multiple alignments of orthologs from the five species using the default settings of CLUSTALW .
Replication slippage has been implicated as a mutational mechanism giving rise to variation in cryptically repetitive sequences  as well as being the major mutational mechanism at microsatellites . As described previously for analyses of rates of evolution of sequences flanking microsatellites , care needs to be taken when analyzing the evolutionary rates of sequences flanking slippage-derived repetitive sequences. This is because sequences immediately flanking the repetitive sequence may also have been derived by slippage and subsequently modified by point mutation, and comparisons of these regions may, therefore, violate the requirement that aligned sites be homologous. By analogy with microsatellites at the DNA level, we therefore defined a transitional zone  for these analyses. This comprised all contiguous amino acid residues one mutational step away from the repeated motif at the codon level. For tandem repeats, the transitional zone started immediately amino- or carboxy-terminal to the limit of the repeat. For C4 repeats we took the region defined by the length of the window used to detect a significant motif (64 amino acids - that is, 30 amino acids either side of the central motif) to define the limits of the repeat, as this is the region containing a significant overrepresentation of the motif in question .
We then used Protdist from the PHYLIP package  to estimate the sequence divergence of a region 33 amino acids either side of the repeat plus transitional zone (the flanking region) and for the remainder of the protein less the flanking regions, transitional zone and repeat region . Distance estimates calculated by Protdist were based upon the Jones-Taylor-Thornton model . For regression analysis of flanking sequence of divergence against protein remainder divergence, outliers (regions whose residual divergence exceeded 2.326 standard deviations after accounting for the regression between flank and remainder) were removed from calculations.
Gene Ontology term analysis
FatiGO+  was used to identify level 3 and 4 GO terms significantly overrepresented in subsets of proteins containing particular repeat types. This analysis was carried out only on human and chicken proteins to minimize effects of multiple testing.
To test whether C4 and tandem repeats are embedded within functional domains or proteins, we searched for domains annotated in the Interpro database using the InterproScan web service [68–70]. The Interpro database characterizes a given protein, domain or functional site by integrating the most commonly used protein annotation databases. Hits to the SUPERFAMILY database were extracted from these results for separate analysis. Results from the PANTHER protein classification system were excluded from this analysis as they refer to protein function rather than domains .
Prediction of intrinsically unstructured regions
amino acid repeat
intrinsically unstructured region
Regional Order Neural Network.
We thank Gail Hutchinson for discussions and moral support and Robert Esnouf and Rebecca Hamer for discussions on RONN predictions. We thank the UK Medical Research Council for financial support.
- Green H, Wang N: Codon reiteration and the evolution of proteins. Proc Natl Acad Sci USA. 1994, 91: 4298-4302. 10.1073/pnas.91.10.4298.PubMedPubMed CentralView ArticleGoogle Scholar
- Hancock JM: Evolution of sequence repetition and gene duplications in the TATA-binding protein TBP (TFIID). Nucleic Acids Res. 1993, 21: 2823-2830. 10.1093/nar/21.12.2823.PubMedPubMed CentralView ArticleGoogle Scholar
- Karlin S, Burge C: Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci USA. 1996, 93: 1560-1565. 10.1073/pnas.93.4.1560.PubMedPubMed CentralView ArticleGoogle Scholar
- Wharton KA, Yedvobnick B, Finnerty VG, Artavanis-Tsakonas S: opa: a novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in D. melanogaster . Cell. 1985, 40: 55-62. 10.1016/0092-8674(85)90308-3.PubMedView ArticleGoogle Scholar
- Huntington's Disease Collaborative Research Group: A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. Cell. 1993, 72: 971-983. 10.1016/0092-8674(93)90585-E.View ArticleGoogle Scholar
- Albà MM, Santibáñez-Koref MF, Hancock JM: Conservation of polyglutamine tract size between mice and humans depends on codon interruption. Mol Biol Evol. 1999, 16: 1641-1644.PubMedView ArticleGoogle Scholar
- Djian P, Hancock JM, Chana HS: Codon repeats in genes associated with human diseases: fewer repeats in the genes of nonhuman primates and nucleotide substitutions concentrated at the sites of reiteration. Proc Natl Acad Sci USA. 1996, 93: 417-421. 10.1073/pnas.93.1.417.PubMedPubMed CentralView ArticleGoogle Scholar
- Lovell SC: Are non-functional, unfolded proteins ('junk proteins') common in the genome?. FEBS Lett. 2003, 554: 237-239. 10.1016/S0014-5793(03)01223-7.PubMedView ArticleGoogle Scholar
- Kazemi-Esfarjani P, Trifiro MA, Pinsky L: Evidence for a repressive function of the long polyglutamine tract in the human androgen receptor: possible pathogenetic relevance for the (CAG)n expanded neuronopathies. Hum Mol Genet. 1995, 4: 523-527. 10.1093/hmg/4.4.523.PubMedView ArticleGoogle Scholar
- Lanz RB, Wieland S, Hug M, Rusconi S: A transcriptional repressor obtained by alternative translation of a trinucleotide repeat. Nucleic Acids Res. 1995, 23: 138-145. 10.1093/nar/23.1.138.PubMedPubMed CentralView ArticleGoogle Scholar
- Pinto M, Lobe CG: Products of the grg (Groucho-related gene) family can dimerize through the amino-terminal Q domain. J Biol Chem. 1996, 271: 33026-33031. 10.1074/jbc.271.51.33026.PubMedView ArticleGoogle Scholar
- Schwechheimer C, Smith C, Bevan MW: The activities of acidic and glutamine-rich transcriptional activation domains in plant cells: design of modular transcription factors for high-level expression. Plant Mol Biol. 1998, 36: 195-204. 10.1023/A:1005990321918.PubMedView ArticleGoogle Scholar
- Alba MM, Santibáñez-Koref MF, Hancock JM: Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol. 1999, 49: 789-797. 10.1007/PL00006601.View ArticleGoogle Scholar
- Young ET, Sloan JS, Van Riper K: Trinucleotide repeats are clustered in regulatory genes in Saccharomyces cerevisiae . Genetics. 2000, 154: 1053-1068.PubMedPubMed CentralGoogle Scholar
- Alba MM, Guigo R: Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 2004, 14: 549-554. 10.1101/gr.1925704.PubMedPubMed CentralView ArticleGoogle Scholar
- Hancock JM, Simon M: Simple sequence repeats in proteins and their potential role in network evolution. Gene. 2005, 345: 113-118. 10.1016/j.gene.2004.11.023.PubMedView ArticleGoogle Scholar
- Fondon JW, Garner HR: Molecular origins of rapid and continuous morphological evolution. Proc Natl Acad Sci USA. 2004, 101: 18058-18063. 10.1073/pnas.0408118101.PubMedPubMed CentralView ArticleGoogle Scholar
- Albrecht A, Mundlos S: The other trinucleotide repeat: polyalanine expansion disorders. Curr Opin Genet Dev. 2005, 15: 285-293. 10.1016/j.gde.2005.04.003.PubMedView ArticleGoogle Scholar
- Anan K, Yoshida N, Kataoka Y, Sato M, Ichise H, Nasu M, Ueda S: Morphological change caused by loss of the taxon-specific polyalanine tract in Hoxd-13. Mol Biol Evol. 2007, 24: 281-287. 10.1093/molbev/msl161.PubMedView ArticleGoogle Scholar
- Mularoni L, Veitia RA, Alba MM: Highly constrained proteins contain an unexpectedly large number of amino acid tandem repeats. Genomics. 2007, 89: 316-325. 10.1016/j.ygeno.2006.11.011.PubMedView ArticleGoogle Scholar
- Hancock JM, Worthey EA, Santibanez-Koref MF: A role for selection in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice. Mol Biol Evol. 2001, 18: 1014-1023.PubMedView ArticleGoogle Scholar
- Faux NG, Huttley GA, Mahmood K, Webb GI, Garcia de la Banda M, Whisstock JC: RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins. Genome Res. 2007, 17: 1118-1127. 10.1101/gr.6255407.PubMedPubMed CentralView ArticleGoogle Scholar
- Wright PE, Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol. 1999, 293: 321-331. 10.1006/jmbi.1999.3110.PubMedView ArticleGoogle Scholar
- Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z: Intrinsic disorder and protein function. Biochemistry. 2002, 41: 6573-6582. 10.1021/bi012159+.PubMedView ArticleGoogle Scholar
- Tompa P: Intrinsically unstructured proteins evolve by repeat expansion. Bioessays. 2003, 25: 847-855. 10.1002/bies.10324.PubMedView ArticleGoogle Scholar
- Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK: Sequence complexity of disordered protein. Proteins. 2001, 42: 38-48. 10.1002/1097-0134(20010101)42:1<38::AID-PROT50>3.0.CO;2-3.PubMedView ArticleGoogle Scholar
- Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, Williams CJ, Dunker AK: Evolutionary rate heterogeneity in proteins with long disordered regions. J Mol Evol. 2002, 55: 104-110. 10.1007/s00239-001-2309-6.PubMedView ArticleGoogle Scholar
- Chen JW, Romero P, Uversky VN, Dunker AK: Conservation of intrinsic disorder in protein domains and families: II. functions of conserved disorder. J Proteome Res. 2006, 5: 888-898. 10.1021/pr060049p.PubMedPubMed CentralView ArticleGoogle Scholar
- Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P: Disorder and sequence repeats in Hub proteins and their implications for network evolution. J Proteome Res. 2006, 5: 2985-2995. 10.1021/pr060171o.PubMedView ArticleGoogle Scholar
- Wootton JC: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem. 1994, 18: 269-285. 10.1016/0097-8485(94)85023-2.PubMedView ArticleGoogle Scholar
- Pellegrini M, Marcotte EM, Yeates TO: A fast algorithm for genome-wide analysis of proteins with repeated sequences. Proteins. 1999, 35: 440-446. 10.1002/(SICI)1097-0134(19990601)35:4<440::AID-PROT7>3.0.CO;2-Y.PubMedView ArticleGoogle Scholar
- Alba MM, Laskowski RA, Hancock JM: Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics. 2002, 18: 672-678. 10.1093/bioinformatics/18.5.672.PubMedView ArticleGoogle Scholar
- Huntley MA, Clark AG: Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species. Mol Biol Evol. 2007, 24: 2598-2609. 10.1093/molbev/msm129.PubMedView ArticleGoogle Scholar
- Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la Banda MG, Whisstock JC: Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res. 2005, 15: 537-551. 10.1101/gr.3096505.PubMedPubMed CentralView ArticleGoogle Scholar
- Richard GF, Dujon B: Trinucleotide repeats in yeast. Res Microbiol. 1997, 148: 731-744. 10.1016/S0923-2508(97)82449-7.PubMedView ArticleGoogle Scholar
- Romov PA, Li F, Lipke PN, Epstein SL, Qiu WG: Comparative genomics reveals long, evolutionarily conserved, low-complexity islands in yeast proteins. J Mol Evol. 2006, 63: 415-425. 10.1007/s00239-005-0291-0.PubMedView ArticleGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001, 313: 903-919. 10.1006/jmbi.2001.5080.PubMedView ArticleGoogle Scholar
- Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP, and related tools. Nat Protoc. 2007, 2: 953-971. 10.1038/nprot.2007.131.PubMedView ArticleGoogle Scholar
- Yang ZR, Thomson R, McNeil P, Esnouf RM: RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics. 2005, 21: 3369-3376. 10.1093/bioinformatics/bti534.PubMedView ArticleGoogle Scholar
- Dunker AK, Obradovic Z, Romero P, Garner EC, Brown CJ: Intrinsic protein disorder in complete genomes. Genome Inform Ser Workshop Genome Inform. 2000, 11: 161-171.PubMedGoogle Scholar
- Oldfield CJ, Cheng Y, Cortese MS, Brown CJ, Uversky VN, Dunker AK: Comparing and combining predictors of mostly disordered proteins. Biochemistry. 2005, 44: 1989-2000. 10.1021/bi047993o.PubMedView ArticleGoogle Scholar
- Bordoli L, Kiefer F, Schwede T: Assessment of disorder predictions in CASP7. Proteins. 2007, 69 (Suppl 8): 129-136. 10.1002/prot.21671.PubMedView ArticleGoogle Scholar
- Jones DT, Ward JJ: Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003, 53 (Suppl 6): 573-578. 10.1002/prot.10528.PubMedView ArticleGoogle Scholar
- Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics. 2004, 20: 2138-2139. 10.1093/bioinformatics/bth195.PubMedView ArticleGoogle Scholar
- Dosztanyi Z, Csizmok V, Tompa P, Simon I: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005, 21: 3433-3434. 10.1093/bioinformatics/bti541.PubMedView ArticleGoogle Scholar
- Dieringer D, Schlotterer C: Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res. 2003, 13: 2242-2251. 10.1101/gr.1416703.PubMedPubMed CentralView ArticleGoogle Scholar
- Tautz D, Trick M, Dover GA: Cryptic simplicity in DNA is a major source of genetic variation. Nature. 1986, 322: 652-656. 10.1038/322652a0.PubMedView ArticleGoogle Scholar
- Hancock JM, Vogler AP: How slippage-derived sequences are incorporated into rRNA variable region secondary structure: implications for phylogeny reconstruction. Mol Phylogenet Evol. 2000, 14: 366-374.PubMedView ArticleGoogle Scholar
- Alba MM, Santibanez-Koref MF, Hancock JM: The comparative genomics of glutamine codon repetition: a category of genes that includes repeat expansion disease genes is prominent in humans and mice and rare in Drosophila . J Mol Evol. 2001, 52: 249-259.PubMedGoogle Scholar
- International Chicken Genome Sequencing Consortium: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004, 432: 695-716. 10.1038/nature03154.View ArticleGoogle Scholar
- Organ CL, Shedlock AM, Meade A, Pagel M, Edwards SV: Origin of avian genome size and structure in non-avian dinosaurs. Nature. 2007, 446: 180-184. 10.1038/nature05621.PubMedView ArticleGoogle Scholar
- Hancock JM: The contribution of slippage-like processes to genome evolution. J Mol Evol. 1995, 41: 1038-1047. 10.1007/BF00173185.PubMedView ArticleGoogle Scholar
- Hancock JM: Genome size and the accumulation of simple sequence repeats: Implications of new data from genome sequencing projects. Genetica. 2002, 115: 93-103. 10.1023/A:1016028332006.PubMedView ArticleGoogle Scholar
- Long JC, Caceres JF: The SR protein family of splicing factors: master regulators of gene expression. Biochem J. 2009, 417: 15-27. 10.1042/BJ20081501.PubMedView ArticleGoogle Scholar
- Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, Oldfield CJ, Campen AM, Ratliff CM, Hipps KW, Ausio J, Nissen MS, Reeves R, Kang C, Kissinger CR, Bailey RW, Griswold MD, Chiu W, Garner EC, Obradovic Z: Intrinsically disordered protein. J Mol Graph Model. 2001, 19: 26-59. 10.1016/S1093-3263(00)00138-8.PubMedView ArticleGoogle Scholar
- Lise S, Jones DT: Sequence patterns associated with disordered regions in proteins. Proteins. 2005, 58: 144-150. 10.1002/prot.20279.PubMedView ArticleGoogle Scholar
- Kreil DP, Kreil G: Asparagine repeats are rare in mammalian proteins. Trends Biochem Sci. 2000, 25: 270-271. 10.1016/S0968-0004(00)01594-2.PubMedView ArticleGoogle Scholar
- Attwood T: Hydropathy (hydrophobicity). Dictionary of Bioinformatics and Computational Biology. Edited by: Hancock JM, Zvelebil MJ. 2004, Hoboken, New Jersey: John Wiley & Sons, Inc, 247-Google Scholar
- Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ: Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA. 2002, 99: 333-338. 10.1073/pnas.012608599.PubMedPubMed CentralView ArticleGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, et al: Ensembl 2007. Nucleic Acids Res. 2007, 35: D610-D617. 10.1093/nar/gkl996.PubMedPubMed CentralView ArticleGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31: 365-370. 10.1093/nar/gkg095.PubMedPubMed CentralView ArticleGoogle Scholar
- MRC Harwell|SIMPLE. [http://www.har.mrc.ac.uk/research/bioinformatics/software/simple.html]
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMedPubMed CentralView ArticleGoogle Scholar
- Hancock JM: Microsatellites and other simple sequences: genomic context and mutational mechanisms. Microsatellites: Evolution and Applications. Edited by: Goldstein DB, Schlötterer C. 1999, Oxford: Oxford University Press, 1-9.Google Scholar
- PHYLIP (Phylogeny Inference Package) version 3.6. [http://evolution.genetics.washington.edu/phylip.html]
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8: 275-282.PubMedGoogle Scholar
- Al-Shahrour F, Minguez P, Tárraga J, Medina I, Alloza E, Montaner D, Dopazo J: FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. 2007, 35: W91-W96. 10.1093/nar/gkm260.PubMedPubMed CentralView ArticleGoogle Scholar
- services:interproscan|EBI Web Services|EBI. [http://www.ebi.ac.uk/Tools/webservices/services/interproscan]
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, et al: New developments in the InterPro database. Nucleic Acids Res. 2007, 35: D224-D228. 10.1093/nar/gkl841.PubMedPubMed CentralView ArticleGoogle Scholar
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Res. 2005, 33: W116-W120. 10.1093/nar/gki442.PubMedPubMed CentralView ArticleGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003, 13: 2129-2141. 10.1101/gr.772403.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.