Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data

Edwards, Yvonne JK; Lobley, Anna E; Pentony, Melissa M; Jones, David T

doi:10.1186/gb-2009-10-5-r50

Research
Open access
Published: 11 May 2009

Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data

Yvonne JK Edwards¹,
Anna E Lobley¹,
Melissa M Pentony¹ &
…
David T Jones¹

Genome Biology volume 10, Article number: R50 (2009) Cite this article

10k Accesses
58 Citations
Metrics details

Abstract

Background

Disordered proteins need to be expressed to carry out specified functions; however, their accumulation in the cell can potentially cause major problems through protein misfolding and aggregation. Gene expression levels, mRNA decay rates, microRNA (miRNA) targeting and ubiquitination have critical roles in the degradation and disposal of human proteins and transcripts. Here, we describe a study examining these features to gain insights into the regulation of disordered proteins.

Results

In comparison with ordered proteins, disordered proteins have a greater proportion of predicted ubiquitination sites. The transcripts encoding disordered proteins also have higher proportions of predicted miRNA target sites and higher mRNA decay rates, both of which are indicative of the observed lower gene expression levels. The results suggest that the disordered proteins and their transcripts are present in the cell at low levels and/or for a short time before being targeted for disposal. Surprisingly, we find that for a significant proportion of highly disordered proteins, all four of these trends are reversed. Predicted estimates for miRNA targets, ubiquitination and mRNA decay rate are low in the highly disordered proteins that are constitutively and/or highly expressed.

Conclusions

Mechanisms are in place to protect the cell from these potentially dangerous proteins. The evidence suggests that the enrichment of signals for miRNA targeting and ubiquitination may help prevent the accumulation of disordered proteins in the cell. Our data also provide evidence for a mechanism by which a significant proportion of highly disordered proteins (with high expression levels) can escape rapid degradation to allow them to successfully carry out their function.

Background

Natively unfolded or disordered proteins are proteins that do not form a stable three-dimensional structure in their native state. A disordered protein can be either completely unfolded or comprise both folded and unfolded segments [1–4]. Previous analyses have shown that the presence of large regions of disorder within proteins correlates strongly with function [1–20]. These functions typically relate to gene regulation and signaling classes that are of particular importance to higher organisms [6, 21]. Previous work has also shown that over 30% of proteins in eukaryotic genomes are likely to be disordered, a percentage that is much higher than found within prokaryotic genomes [6, 12, 22, 23]. Whilst there are functional benefits that accrue from disordered proteins, the use of disorder carries with it significant risks [24]. The prevalence of human diseases that correspond to highly disordered proteins is striking [24–31]; these include diabetes, neurodegenerative disorders [25–28], cardiovascular disease [29] and cancer [30]. In fact, many neurodegenerative disorders arise from the aggregation of disordered proteins [25–28]. If disordered proteins are indeed potential hazards to the healthy maintenance of human cells, then both their production and disposal should be very carefully regulated. Such is the danger of protein aggregation in living cells that a number of efficient degradation mechanisms are in place to quickly dispose of misfolded proteins [32]. The problem for disordered proteins may well be to survive long enough to carry out their function in such a hostile environment.

The equilibrium level of a protein depends on its rate of production relative to its rate of degradation. The quantity of a protein produced in the cell is affected by the expression level of its mRNA transcript. The levels of gene expression are controlled in the cell in a number of different ways - for example, by varying the rates of transcription and translation and altering the rate at which mRNA is degraded. In combination with transcription, mRNA degradation plays a critical role in regulating gene expression [33, 34]. If proteins need to remain in the disordered state for any length of time, they need to either bypass the endogenous degradation pathways (such as the ATP-dependent proteolytic 26S proteasome [32]) that specifically target unfolded proteins or be produced in sufficient quantity to temporarily overload the protein degradation pathways. The second option is, of course, extremely risky as high production levels of disordered proteins may result in aggregation. This suggests that the first option is the most likely, but in this case, how can disordered proteins escape rapid degradation to allow them to successfully carry out their function.

Recent work suggested that disordered residues make a protein more susceptible to intracellular degradation [35]. The in vivo half-lives of yeast proteins were shown to correlate with disorder as opposed to the actual degradation signals and motifs. In our study we analyze biological properties known to regulate and affect the degradation rates of proteins and transcripts to investigate how these correlate with protein disorder. Gene expression is a continuous process spanning transcription factor activation, nuclear localization of transcription factors, chromatin decompaction, coupled initiation and 5' capping of transcripts, coupled transcription and mRNA processing, splicing, cleavage and 3' polyadenylation, mRNA packaging, mRNA export into the cytoplasm, translation and protein folding [36]. Biological processes that lower the mRNA copy numbers include proteolytic degradation by proteases, microRNA (miRNA):mRNA targeting and destruction of mRNA by nucleases. Here, we characterize absolute mRNA levels, mRNA decay rates, protein stability, predicted miRNA targeting and ubiquitination to assess whether disordered proteins (and their encoding transcripts) display any unusual characteristics.

miRNAs are a class of small non-coding RNA molecules (comprising about 22 nucleotides) that regulate gene expression and mediate diverse cellular processes such as development, differentiation, proliferation and apoptosis [37–41]. miRNAs target the 3' untranslated regions of mRNA molecules, which typically results in the down-regulation of gene expression by translational repression and/or a reduction of mRNA transcript levels [42]. Several algorithms are available to predict the mRNA targets [43–51].

Ubiquitination is a reversible post-translational modification of cellular proteins where ubiquitin (a 76 residue protein) is covalently attached to the ε amino group of lysines of target proteins. Diverse forms of ubiquitin modifications exist and influence the functional outcome of target proteins in distinct ways [52, 53]. Mono-ubiquitination or multi-ubiquitination are implicated in various nonproteolytic cellular functions, including endocytosis, endosomal sorting and DNA repair [52]. Polyubiquitination is mainly associated with proteasomal degradation [54, 55]. Whilst ubiquitination can determine the fate of a given protein for proteolytic degradation by the 26S proteosome, ubiquitination of transcription factors with a VP-16 activation domain is also shown to be required for transcriptional activation [56–58]. Like miRNA targeting [59–69], ubiquitination is crucial in regulating a variety of cellular processes in eukaryotes [59–61] and has significant implications in the etiology of a number of serious diseases such as cancer [62–64], neurodegeneration [65, 66] and cardiovascular dysfunction [67–69].

To gain new insights into the regulation of disordered proteins, we carried out a series of studies to examine how a number of features known to affect protein and transcript degradation correlate with protein disorder. We investigated whether the mRNA transcripts encoding disordered proteins decay more rapidly. To establish mRNA expression patterns for transcripts encoding disordered proteins and to reveal novel insights into the molecular mechanisms of transcriptional regulation [70–74], mRNA expression levels were characterized in normal tissues and cell lines using public domain microarray expression datasets. Transcripts co-expressed with the transcripts encoding disordered proteins were identified to suggest the key biological pathways that are affected or under regulatory control of disordered proteins and their transcripts. We investigated whether disordered proteins have lower expression levels and/or the transcripts encoding them are more likely to be targeted by miRNA. One of the aims of this analysis was to use miRNA prediction to establish the trends that exist between possible miRNA targeting and the transcripts encoding disordered proteins. We examined if disordered proteins contain sites that are more susceptible to degradation using a novel ubiquitination site prediction tool. Protein turnover rates for disordered sequences were also investigated by considering stability determined from an in vivo study measuring protein turnover [75].

In this study, we examine the available human gene expression data and properties of the human proteome and transcriptome to investigate whether disordered proteins have any unusual characteristics in terms of their production and disposal in human cells. Specifically, we were interested in gaining insights into the means by which disordered proteins avoid early degradation without resorting to the severe risks of over-expression.

Results

Five properties of the human proteins and transcripts were investigated in relation to disorder in the proteome. First, three expression profile studies on transcripts encoding disordered proteins were carried out: the general features of their expression levels were characterized; their expression profiles across the samples were clustered by abundance and functionally annotated to provide a classification of the biological roles of their encoded proteins; and transcripts co-expressed with them were identified. Second, we searched for correlation between the extent of mRNA decay rates and varying amounts of protein disorder encoded by transcripts. Third, the occurrence of disorder was compared with protein stability indices determined by a global stability profiling assay. Fourth, miRNA prediction tools were used to establish trends that exist between transcripts encoding disordered proteins and miRNA targeting. Finally, correlations between ubiquitination sites and protein disorder levels were investigated.

Protein disorder and gene expression

Protein disorder and absolute gene expression levels

On average, transcripts that encode highly disordered proteins are expressed in lower copy numbers than those that encode highly ordered proteins (Figure 1a). Figure 1a shows the average absolute gene expression values calculated across 207 normal tissue and cell line samples (Table 1). Whilst the scale for the absolute values is displayed in log₂ units, in the decimal scale the absolute gene expression levels of the genes for transcripts that encode highly disordered proteins are roughly half those of the genes for transcripts that encode highly ordered proteins. A similar trend was obtained for transcripts that encode disordered and ordered proteins (Figure S1a in Additional data file 1).

Table 1 Bioinformatics analysis of expression of human genes across 207 samples from 75 different types of normal tissues and cell lines

Full size table

To investigate whether these low expression levels were correlated with occurrence of disorder in the protein products, transcripts were grouped according to the frequency of disorder in the encoded protein (Figure 2a). As the percentage of disordered residues increases to between > 60% and ≤ 80% (or from now on (60,80]% in standard interval notation), the average gene expression level steadily decreases. However, for the (80,100]% disorder category the average sample expression levels were greater than expected using a Wilcoxon paired rank test (P < 0.0001). This (80,100]% category comprises <1% of the data (Table 2). To verify that these trends were independent of function, we filtered the data to impose equality of representation of biological process (BP) and molecular function (MF) Gene Ontology (GO) terms. Specifically, a maximum of ten randomly chosen examples were selected for each annotation term at specificity level 4 or below. The results (Figure 2a) indicate that the correlation between transcript expression levels and the amount of disorder are not dictated by function class bias and represent genuine and robust features of the data.

Table 2 Percentage of transcripts encoding disordered proteins predicted to be targeted by miRNA

Full size table

Absolute gene expression profiles for highly disordered proteins

To differentiate modes of gene expression behavior among the highly disordered proteins, hierarchical clustering analysis of the absolute expression levels was carried out. The resulting heat map (Figure 3a) shows that the situation is not as simple as suggested in Figure 1. Five broad classes of expression patterns for the genes encoding highly disordered proteins could be defined (Figure 3; Tables S1 and S2 in Additional data file 2). These groups were functionally characterized by performing over-representation tests within each of the five classes. The first set of transcripts (light blue) encode proteins that are almost entirely disordered and contained within the (80,100]% disorder category. In this constitutively expressed group, all transcripts represent large ribosomal subunits that are essential parts of the transcription machinery and expressed in every cell. The second group (dark blue) represents transcripts that exhibit high expression levels in the majority of tissues and display little or no tissue specificity. The third group (green) contains transcripts expressed at medium levels. General DNA binding and transcription factor functions were over-represented in the proteins encoded by the medium expressor group. The fourth group (gold) contains transcripts expressed in a tissue-specific manner. The remaining transcripts form a group not detected to be abundant in any of the tissues studied and is referred to as the low or transient expressor group (gray). This low or transient expressor group comprises over 50% of transcripts analyzed (Table 3) and is primarily responsible for the low expression trend reported above. This suggests that over half of the transcripts encoding proteins with large regions of disorder are expressed either at transient or low levels.

Table 3 miRNA targeting of disordered proteins with different gene expression profiles (Figure 4)

Full size table

Co-regulated transcripts and the highly disordered proteins

A similar functional analysis was carried out for all transcripts detected to be significantly co-regulated with transcripts encoding disordered proteins. Co-regulation was established using significance of the correlation coefficient between transcripts and was calculated for transcript pairs in the (60,80]% and (80,100]% disorder groups. Using empirically derived P-values from the distribution of correlations, a significance threshold at either tail of P < 0.01 was used to describe transcripts as co-regulated. Several of the categories identified as enriched in the co-regulated transcript datasets overlapped and are summarized. In general, the activities of the ubiquitin degradation pathway and the proteolytic catabolic processes were observed to be anti-correlated (down-regulated) with the expression profiles of transcripts encoding highly disordered proteins. Functions enriched in the significantly correlated transcript set included protein complex formation, protein dimerization, protein homo-dimerization, protein hetero-oligomerization and enzyme inhibitors that reduce the activity of proteases (that is, enzymes catalyzing the hydrolysis of peptide bonds) (Table 4).

Table 4 Subsets of GO terms (biological process, molecular function and cellular component) over-represented for co-regulated transcripts encoding highly disordered proteins

Full size table

Protein disorder, mRNA decay rates and protein stability indices

The mRNA decay rates of the transcripts of 74 highly disordered proteins and 536 highly ordered proteins were compared. The mRNA decay rates for the transcripts encoding highly disordered proteins (0.190871 h^-1) are more than twice that observed for the transcripts encoding highly ordered proteins (0.084944 h^-1) (Figure 1b). A statistically significant difference (P < 0.02) between mRNA decay rates for transcripts encoding highly ordered and highly disordered proteins was found, with the highly disordered datasets having higher mRNA decay rates. The mRNA decay rates for the transcripts encoding 1,980 disordered proteins (0.177596 h^-1) and 1,858 ordered proteins (0.096878 h^-1) were also compared and a similar trend was obtained (Figure S1b in Additional data file 1).

We divided the 33,869 proteins into bins by percentage of disordered residues. When we compared the mRNA decay rates for each of the bins (Figure 2b), there was no significant difference between them. Although this result does not suggest that all disordered proteins show a significant association with higher mRNA decay rates, it does concur with our previous analysis of the (highly) ordered and (highly) disordered protein datasets, in showing a distinct difference between mRNA decay rates for both groups.

The protein stability measures of the highly disordered (179) and highly ordered groups (1,396) were also compared. We found a significant difference (P < 0.0005) between the half-lives of highly ordered and highly disordered proteins, with highly disordered proteins having longer half-lives (Figure 1c).

Consistent with our analysis of decay rates, we divided the 8,666 disordered proteins into bins by percentage of disordered residues. Protein stability indices showed no significant affiliation to a particular binned group, although the (80,100]% disorder bin showed much higher half-lives than the other binned groups (Figure 2c).

Since trends were observed between both mRNA decay rate and disorder, and protein half-life and disorder, the half-lives and decay rates were also compared to see if a relationship existed between mRNA decay rate and protein half-life. The Pearson correlation value between 1,446 overlapping sequences (-0.06) was not significant and suggested that these two characteristics are independent.

Protein disorder and miRNA targets

Approximately one-quarter of protein coding transcripts are predicted miRNA targets (Table 2). The proportion of transcripts encoding highly disordered proteins that are likely to be miRNA targets is approximately twice that of transcripts encoding highly ordered proteins (Figure 1d; Table 2). The frequency of transcripts with at least one predicted miRNA target site is over-represented in the transcripts encoding highly disordered proteins (P < 0.003) and under-represented in the transcripts encoding highly ordered proteins (P < 0.00001) compared to all transcripts together (Figure S2a in Additional data file 1). A similar trend is observed when comparing the datasets of transcripts encoding ordered and disordered proteins (Table 2); the proportion of the transcripts encoding disordered proteins that are predicted as miRNA targets is approximately twice that of the transcripts encoding ordered proteins (Figure S1c in Additional data file 1; Table 2). miRNA targets are over-represented in the transcripts encoding disordered proteins (P < 0.00001) and under-represented in the transcripts encoding ordered proteins (P < 0.00001) compared to all transcripts together (Figure S2b in Additional data file 1).

For the transcripts encoding the proteome, the percent likely to be targeted by miRNA ranges between 13.2% and 37.6% (Figure 2d; Table 2). The percentage of transcripts regulated by miRNA increases (approximately 8%) with increasing percentage of protein disorder for the first three binned categories (Figure 2c; Table 2). The percent of predicted miRNA targets for transcripts remains high (35.1%) for the (60,80]% disorder category and low (13.2%) for the [80, 100]% disorder category. Consistently, the likely miRNA targets are under-represented in the [0,20]% and (80,100]% disorder categories at P < 0.00004 (Figure S2c in Additional data file 1) and over-represented in the remaining three classes (P < 5.8 × 10^-7; Figure S2c in Additional data file 1).

Similar trends are obtained using the PicTar (4-Way and 5-Way) software [43, 46] (Figures 1d and 2d; Figure S1c in Additional data file 1). The trends were not observed using mirBase [51] and this could be because this prediction algorithm is reported to have a higher false positive rate than the other two programs (PicTar and TargetScanS) [47, 49, 50]. Redundancy in the datasets makes very little difference to the outcome (Table S3 in Additional data file 2). For example, the proteome and the protein sets filtered for redundancy have very similar percentages of transcripts predicted as targets of miRNA (Table 2; Table S3 in Additional data file 2).

We investigated the patterns of the predicted miRNA targets in the transcripts for disordered proteins in relation to the different expression profiles (Figures 3 and 4 and Table 3). The probes on the microarray chip have a higher representation of predicted miRNA targets (38%) in comparison with the transcriptome encoding the human proteome (26.45%) (Table 2). We compared the protein coding transcripts for the five datasets (Figure 3) using the probes on the microarray chip as a universal protein baseline. The data from the constitutive group had too few data points from which to make inferences (Table 3 and Figures 3 and 4). The tissue-specific expressors (gold) and the high expressors (dark blue) have high expression levels. The main difference between the two classes is that the tissue-specific expressors (gold) have high expression in one or few tissues (Figure 3) and the high expressors (dark blue) have high expression in almost all tissues (Figure 3). These two groups characterized by high levels of gene expression have high percentages of transcripts predicted as miRNA targets (68.09% and 65.85%, respectively; Table 3 and Figure 4). The medium expressors (green) and the low or transient expressors (white) with more moderate levels of gene expression have lower percentages of predicted miRNA targeting (48.39% and 48.06%, respectively). These results suggest that the transcripts of disordered proteins with high levels of expression are more likely to be regulated by miRNA compared to those with moderate and low or transient expression. In addition, the transcripts of highly disordered proteins belonging to the four expression profiles (tissue-specific, high expressors, medium expressors and low or transient expressors) are more likely to be miRNA targets than the transcripts on the microarray chip (Figure 4b). This observation supports the trend observed previously (Table 2) that transcripts encoding disordered proteins are more likely to be targeted by miRNAs compared to protein coding transcripts in general (Figure 4; Figures S1c and S2c in Additional data file 1).

Protein disorder and ubiquitination

To our knowledge, this study presents the first estimate of the percentage of proteins of the human proteome with at least one predicted ubiquitination site and the percentage of residues predicted as ubiquitination sites. We predict that 70.71% of proteins have at least one ubiquitination site and 0.42% of amino acid residues in the proteome are ubiquitination sites.

The percentage of proteins predicted to contain at least one ubiquitination site and the percentage of residues predicted as ubiquitination sites are higher in disordered proteins compared to ordered proteins. Comparing the highly disordered proteins with the highly ordered proteins, we observe increases of 33.81% and 42.50% in the percentage of proteins possessing at least one ubiquitination site and the percentage of residues predicted to be ubiquitination sites, respectively (Figure 1e). The proteins possessing at least one ubiquitination site are slightly over-represented in the highly disordered proteins (P < 0.98; Figure S3a in Additional data file 1) and grossly under-represented in the highly ordered proteins (P < 2.2 × 10^-16; Figure S3a in Additional data file 1). The first trend is not statistically significant. The predicted ubiquitination sites are over-represented in the highly disordered proteins (P < 2.2 × 10^-16; Figure S4a in Additional data file 1) and under-represented for the highly ordered proteins (P < 0.002; Figure S4a in Additional data file 1). Comparing the disordered proteins with the ordered proteins, we observe increases of 33.57% and 12.8% in the percentage of proteins possessing at least one ubiquitination site and the percentage of residues predicted to be ubiquitination sites, respectively (Figure S1d in Additional data file 1). Proteins with one or more predicted ubiquitination sites are over-represented in the disordered datasets (P < 2.2 × 10^-16; Figure S3b in Additional data file 1) and under-represented in the ordered proteins (P < 2.2 × 10^-16; Figure S3b in Additional data file 1). A similar trend is obtained for the percentage of residues predicted as ubiquitination sites.

The relationship between the percentage of proteins with at least one ubiquitination site and the percentage of protein disorder is complex and non-linear, while the percentage of residues predicted as ubiquitination sites and the percentage of protein disorder are positively correlated. The percentage of proteins predicted to have a ubiquitination site increases with the percentage of protein disorder for the first three disorder categories (Figure 2e). The percentage of proteins predicted to have a ubiquitination site remains high at 74.3% for the (60,80]% disorder class and then drops significantly to 55.8% for the (80,100]% disorder category. This is consistent with proteins with one or more predicted ubiquitination sites being over-represented in the (20,40]%, (40,60]% and (60,80]% disorder categories (P < 0.04; Figure S3c in Additional data file 1) and under-represented in the [0,20]% and (80,100]% disorder categories (P < 0.00005; Figure S3c in Additional data file 1). On examination of the second ubiquitination descriptor, a different trend is observed; the percentage of residues predicted as ubiquitination sites increases as the percentage of protein disorder increases, illustrating a strong positive correlation between the two variables (Figure 2e). Proteins with one or more predicted ubiquitination sites are under-represented in the [0,20]% disorder category and over-represented in the remaining four disorder classes (P < 2.2 × 10^-16; Figure S4c in Additional data file 1).

As lysine is over-represented in disordered regions [1, 76, 77], we investigated the percentage of residues predicted as ubiquitination sites in relation to the percentageof protein disorder, taking into account lysine residue biases (Figure S5a in Additional data file 1). First, we calculated a correlation coefficient for the percentage of predicted ubiquitination sites and the percentage of lysine composition for the five disorder categories and obtained a strong positive correlation (R = 0.844772). Second, we normalized the number of predicted ubiquitination sites with respect to the number of lysines for each dataset. The trends observed for the percentage of predicted ubiquitination sites normalized for lysine frequency and disorder are similar to those obtained with the percentage of predicted ubiquitin sites and disorder ignoring lysine biases (Figure S5b in Additional data file 1). Comparing the disorder categories with the order categories, the calculations normalized using the lysine frequency result in differences that are smaller in magnitude. For example, comparing the highly disordered proteins with the highly ordered proteins, an increase of 23.5% is observed instead of 42.5%, and comparing the disordered proteins with the ordered proteins, an increase of 4.4% is observed instead of 12.8%.

Discussion

This is the first analysis presenting a comprehensive and systematic study of gene expression levels, mRNA decay rates, miRNA targeting and ubiquitination in association with transcripts encoding protein disorder in humans. Using the human proteome and transcriptome, we set out to elucidate novel insights into the regulation of disordered proteins. This aim was achieved and we discuss our findings in the following sections.