Transcriptional regulation of protein complexes in yeast

Simonis, Nicolas; van Helden, Jacques; Cohen, George N; Wodak, Shoshana J

doi:10.1186/gb-2004-5-5-r33

Research
Published: 30 April 2004

Transcriptional regulation of protein complexes in yeast

Nicolas Simonis¹,
Jacques van Helden¹,
George N Cohen² &
…
Shoshana J Wodak¹

Genome Biology volume 5, Article number: R33 (2004) Cite this article

1171 Accesses
29 Citations
Metrics details

Abstract

Background

Multiprotein complexes play an essential role in many cellular processes. But our knowledge of the mechanism of their formation, regulation and lifetimes is very limited. We investigated transcriptional regulation of protein complexes in yeast using two approaches. First, known regulons, manually curated or identified by genome-wide screens, were mapped onto the components of multiprotein complexes. The complexes comprised manually curated ones and those characterized by high-throughput analyses. Second, putative regulatory sequence motifs were identified in the upstream regions of the genes involved in individual complexes and regulons were predicted on the basis of these motifs.

Results

Only a very small fraction of the analyzed complexes (5-6%) have subsets of their components mapping onto known regulons. Likewise, regulatory motifs are detected in only about 8-15% of the complexes, and in those, about half of the components are on average part of predicted regulons. In the manually curated complexes, the so-called 'permanent' assemblies have a larger fraction of their components belonging to putative regulons than 'transient' complexes. For the noisier set of complexes identified by high-throughput screens, valuable insights are obtained into the function and regulation of individual genes.

Conclusions

A small fraction of the known multiprotein complexes in yeast seems to have at least a subset of their components co-regulated on the transcriptional level. Preliminary analysis of the regulatory motifs for these components suggests that the corresponding genes are likely to be co-regulated either together or in smaller subgroups, indicating that transcriptionally regulated modules might exist within complexes.

Background

Multiprotein complexes such as the ribosome, spliceosome, cyclosome, proteasome and the nuclear pore complex have an essential role in cellular processes [1–3]. Until recently, information about the building blocks of specific complexes has been rather selective, and the mechanisms underlying the formation of these complexes, and their regulation, lifetimes and degradation remain largely unknown.

One can surmise that the formation of multiprotein complexes might be regulated at different levels, including transcriptional regulation, post-translational modification and degradation. In prokaryotes a significant proportion of the genes that are co-regulated at the transcriptional level code for proteins that interact physically. This proportion is even higher for gene groups whose co-regulation is conserved in different genomes [4]. In some multiprotein complexes in bacteria, the individual components were reported to be expressed 'as needed', in a time-dependent fashion related to their role in the complex [5].

In eukaryotes, mainly limited to yeast, gene-expression profiles have been shown to correlate with protein function and protein-protein interactions [6–8]. More particularly, genes corresponding to components of multiprotein complexes were found to exhibit correlated expression profiles, especially for complexes that form over a wide range of cellular conditions [8]. In contrast, the relationships between gene expression and genome-scale two-hybrid interaction data appear to be more tenuous [6, 7, 9].

Yeast is an ideal model system in which to investigate the relations between protein interactions and gene co-regulation. It is one of the few organisms in which many individual protein complexes have been characterized by biochemical and other methods, with results available in the Comprehensive Yeast Genome Database (CYGD) [10]. In addition, two independent studies recently characterized multiprotein complexes in yeast by a large-scale experimental approach involving tandem affinity purification and MS analysis (TAP [11]) and high-throughput MS protein complex identification (HMS, [12]). Each study identified several hundred complexes, containing on average about eight and eleven polypeptides, respectively. Many of these were shown to be associated with known cellular processes.

Yeast has also served as a model for the analysis of gene expression [13–15] and transcriptional regulation [16, 17]. Information about the target genes of transcription factors has been compiled in specialized databases such as TRANSFAC [18, 19], SCPD [19], YPD [20] and aMAZE [21, 22]. Most recently, the genes bound by 106 yeast transcription factors were identified by a high-throughput approach [16], producing for the first time a global view of the transcriptional regulation network in this organism.

Here we investigate the transcriptional regulation of multiprotein complexes in yeast. In particular we aimed at finding out to what extent components of such complexes are co-regulated. We first determined the overlap between known sets of co-regulated genes in yeast and groups of genes coding for components of individual multiprotein complexes. A set of co-regulated genes is defined here as the group of target genes of the same transcription factor, and is denoted a 'regulon', in agreement with the classical concept of Maas [23]. Two categories of regulons are considered. The manually curated regulons stored in the databases, and the regulons defined by the gene-factor associations identified in the high-throughput analyses mentioned above [16]. The protein complexes examined are those manually curated in databases and the two datasets derived from the recent genome-scale analyses.

We then applied pattern-discovery algorithms [24, 25] to the upstream sequences of genes coding for the proteins involved in each of the complexes in the three datasets under consideration. These algorithms are used to detect sequence patterns shared by some or all of these genes, which are likely to represent binding sites for transcription factors. These patterns take the form of short oligonucleotides (hexamers or pairs of trimers) that occur much more frequently in the upstream regions of these genes than in the corresponding regions across the entire yeast nuclear genome.

We have shown recently that these algorithms have an important advantage of returning predictions with a very small rate of false positives (over-represented patterns in groups of randomly selected genes) when stringent enough statistical criteria are used [26]. Alternative methods based on matrix descriptions [27–31] allow a more refined description of pattern degeneracy, in which a given sequence position need not be strictly conserved. But, unlike the approach used here, they have the inconvenience of nearly always returning a prediction, even for random sequences. This is particularly problematic when analyzing large groups of genes, of which a sizable proportion might not be regulated at the transcriptional level, or at least not by the same transcription factor, as might be the case for many of the protein complexes examined here.

Using the set of patterns detected for each complex, we proceeded to predict the components of the complex that are likely to be co-regulated. This is a difficult task, as the upstream regions of genes often contain multiple binding sites for the same factor or can be regulated by a combination of different factors that bind to distinct sites [32, 33]. In addition, pattern-discovery algorithms generally return a number of strongly overlapping patterns for a given transcription factor, indicating the presence of a partial degeneracy [24, 25]. Therefore, identifying sets of co-regulated genes usually involves assembling the patterns into longer motifs, and searching for upstream regions that score highly against these motifs, an approach that often yields ambiguous results.

Here we use an alternative approach in which a discriminant analysis is performed directly on the detected short patterns and their multiple occurrences [26], thereby avoiding the difficult task of pattern assembly. This analysis is done for all the complexes considered and the results are discussed in terms of our current knowledge of these complexes and their regulation.

Together, the approaches presented here provide valuable insights into the transcriptional regulation of multiprotein complexes in yeast and help in extracting information on function from genome-scale datasets for these complexes.

Results

Correspondence between multiprotein complexes and known regulons

The genes coding for the components of each protein complex in the different datasets are compared to those in the known regulons, with the aim of detecting complex-regulon pairs where the overlap between the components is more extensive than would be expected by chance.

The analyzed datasets of complexes comprised 243 annotated protein complexes from CYGD [10] and 725 complexes identified by the high-throughput studies [11, 12]. The complexes from the latter two studies were taken as defined by their authors, without further grouping [34]. The regulons datasets comprised the 200 annotated and the 106 high-throughput regulons [16].

To determine whether the number of common components for a given complex-regulon pair is above chance level, or statistically significant, we compute the expectation value (E-value) of observing at least that number by chance, and retain only pairs with an E-value below a certain threshold (see Materials and methods).

Correspondence between regulons and annotated protein complexes

Table 1 lists the complex-regulon pairs whose overlap is above chance level (E-value ≤ 0.01), obtained when mapping the annotated complexes onto the annotated (Table 1a) and high-throughput (Table 1b) regulons, respectively. It is striking to see that the 243 annotated complexes and 306 known regulons form a total of only 57 pairs with a statistically significant overlap. Forty of those are with the annotated regulons, and the remaining ones (only 17 in total) are with the high-throughput regulons. Those pairs involve only about 8% of complexes (20 out of 243) and 14% of the regulons (44 out of 306). The overlap between known regulons and annotated complexes is thus on the whole quite limited.

Table 1 Statistically significant associations between annotated complexes and known regulons

Full size table

Relating protein complexes to gene-expression data, Jansen et al. [7] found it useful to distinguish between two major categories of complexes. 'Permanent' complexes are defined as those that are detected under a wide range of different cellular conditions, whereas 'transient' ones are defined as complexes that form under a specific set of conditions. While keeping in mind that this division is probably oversimplified and could sometimes be misleading, we follow these authors in considering it a helpful working hypothesis. The list of complexes in each category was derived from Jansen et al. [7] with some editing. We classified complexes that did not clearly fit either of the first two categories, and some larger assemblies composed of several complexes, as 'other'.

Table 1 reveals that meaningful overlaps between complexes and known regulons occur for both permanent and non-permanent complexes. Associations with the annotated regulons involve fewer complexes of the permanent category than of non-permanent ones (Table 1a). In contrast, the associations with the high-throughput regulons involve more permanent complexes than transient ones (Table 1b), in better agreement with the reported stronger relations of permanent versus transient complexes with mRNA expression profiles [7].

Another interesting observation is that the set of complexes into which regulons map and the extent of overlap between complexes and regulons is also quite different for the annotated and high-throughput regulon datasets. Regulons from both datasets map into complexes such as cytochrome bc₁, nucleosomal protein complex, ribonucleoside diphosphate reductase and fatty-acid synthetase. On the other hand, complexes such as the proteasome, the Cdc28p cyclins and RNA polymerase II are only involved in associations with annotated regulons (Table 1a), whereas the ribosomal subunits or cytochrome c oxidase complexes are only involved in associations with high-throughput regulons (Table 1b).

These and other differences are most likely to be due to the different composition of the regulon repertoires in the two datasets. The annotated dataset contains nearly twice as many regulons as the high-throughput one. But the regulons in the latter dataset are significantly larger, with on average six times more genes than in the annotated regulons (see Materials and methods). It is therefore not too surprising that for associations involving high-throughput regulons, the fraction of the components of a given complex covered by a regulon is in general higher than for annotated regulons. It should at the same time be cautioned that the high-throughput regulons probably contain a fair number of spurious members (false positives) [26].

Zoom-in on the overlaps between regulons and annotated complexes

We see that a complex is often associated with several regulons. This is due in part to the substantial overlap that often exists between the components of individual regulons. The most severe cases occur when different transcription factors are annotated as regulating the exact same set of genes, a situation that is often encountered for small regulons, and probably results from incomplete information or because some transcription factors act in combination or as complexes [35]. We see for example that seven regulons map into the nucleosomal protein complex, six map into the ribonucleoside diphosphate reductase complex, and as many as 10 regulons map into the modular Cdc28p cyclin complexes (Table 1a).

A given regulon also maps, in general, into more than one complex, often onto two, and occasionally onto three. These multiple associations form a patchy network, with several disconnected clusters, which link complexes to regulons. The network graphs built from the associations of the annotated complexes, with annotated and high-throughput regulons, respectively, are illustrated in Additional data file 1 (Figures S1 and S2).

Details of some of these clusters are illustrated in Figures 1 and 2, highlighting the common genes involved. The nucleosomal protein complex (Figure 1a) has seven out of its eight components in common with seven small regulons - Hta1/Hta2, Spt10/Spt21 and Hir1/Hir2/Hir3 - whose genes partially overlap one another. The ribonucleoside diphosphate reductase complex (Figure 1b) has all its four components in common with a total of six partially overlapping regulons. The picture is significantly more complicated for the cyclin-Cdc28p complexes (Figure 1c). As many as 10 regulons map into the 10 components of this complex: the Cln1 and Cln2 genes, which are regulated by as many as five different transcription factors, and two transcription-factor genes, Swi4 and Mcm1, also map into the glucan synthases and pre-replication complex, respectively.

Correspondence between regulons and high-throughput protein complexes

The total number of statistically significant overlaps (E-value ≤ 0.01) is also very low (66 in total) when the known regulons are mapped onto TAP complexes and HMS complexes, even though the number of complexes considered is much larger (725).

The majority of the complex-regulon pairs with meaningful overlap (53) involve annotated regulons, whereas only 13 pairs involve high-throughput regulons. Matches with regulons from either dataset generally involve only a very small subset of the complex components, and there are twice as many matches with complexes from the HMS than from the TAP datasets, in line with the larger size of the former dataset (for a complete list of associations, see Additional data file 2 (Table S2)).

Owing to the appreciable overlap between the components of different complexes within and between the TAP and HMS datasets, the network of associations between these complexes and the regulons is much more intricate than for the annotated complexes. A network graph was built from the larger set of 125 complex-regulon pairs with meaningful overlaps (E-value ≤ 0.1) involving the annotated regulons (Figure 3). This network features seven separate dense clusters of connections (Figure 3a-g). Details of the regulon-complex overlaps in some of these clusters, highlighting the common genes involved, are depicted in Figure 4a-c. The remaining clusters are detailed in Additional data file 1 (Figure S3). In Figure 4h the set of remaining very small clusters, each involving mostly one or two connections, is grouped.

The first cluster (Figure 4a) corresponds chiefly to the overlap between the Rpn4 regulon and 12 rather large complexes (six TAP and six HMS complexes). Nine of the 11 genes of this regulon map onto these complexes. All the complexes contain components of the yeast proteasome, and some other functionally related proteins in variable proportions. Interestingly, six of the nine common genes correspond to proteins from the 19S regulatory subunit, encoding four of the six ATPases in the subunit (Rpt2, Rpt4, Rpt5, Rpt6). A further two genes, PRE6 and PRE2, code, respectively, for alpha and beta subunits of the catalytic domain [36], and another gene (RAD23) encodes a ubiquitin-like protein, which links DNA repair to the ubiquitin/proteasome pathway [37].

The second cluster (Figure 4b) involves four partially overlapping regulons of three genes each, totaling five genes. These genes map into three medium-sized complexes (6-16 genes) and one large complex of 40 genes, with no more than two to three genes mapping into the same complex. Here, too, the majority of the five genes correspond to a biologically active assembly - the ribonucleoside diphosphate reductase complex and associated kinase. The third cluster (Figure 4c) involves genes of the nucleosomal protein complex. A similar analysis can be made for the remaining four clusters (data not shown), and similar observations are made when analyzing the largest clusters in the network graph built from the 46 statistically significant overlapping pairs (E-value ≤ 0.1) involving the TAP and HMS complexes and high-throughput regulons (see Additional data files 1 and 2 (Figures S4, S5 and Table S2d, respectively)).

This detailed analysis shows that although the subset of the components of the multiprotein complexes that corresponds to known regulons is usually quite small, it tends to be composed of proteins with close physical interactions and/or clear functional relations. We also find that the bulk of the overlaps involve genes that map into both permanent complexes such as the proteasome or the nucleosomal-protein complex, as well as into non-permanent ones, such as the ribonucleoside diphosphate reductase and the cyclin-Cdc28p complexes. No clear trends can therefore be identified from these data on the regulation of any one category of complexes in particular.

Prediction of cis-acting regulatory elements in genes of multiprotein complexes

The very limited overlap between complexes and regulons detected above might be biologically meaningful, or might be due to the limited information that is currently available on the nature of protein complexes and regulatory networks in yeast. Given these uncertainties, it seemed of interest to complement the above analyses by an approach in which regulons are directly predicted from the components of protein complexes.

If the components of a given protein complex are co-regulated on the transcriptional level, one would expect to find common regulatory sequence elements, corresponding to transcription factor binding sites, in the upstream regions of the corresponding genes. The problem of identifying regulatory sites is notoriously difficult [33]. To tackle it we applied algorithms for the discovery of oligonucleotides (here, hexanucleotides) [24] and spaced pairs of trinucleotides [25], which occur more frequently in the upstream regions of the genes coding for the components of each complex than in the corresponding regions across the entire yeast nuclear genome. For this approach we considered only complexes with at least five components.

Highly significant patterns are detected for only a small subset of the complexes

Figure 5 plots the number and fraction of the protein complexes in each of the three datasets examined (the annotated, TAP and HMS complexes) for which regulatory-sequence patterns were identified by our prediction method using three different reliability thresholds. Plotted alongside are the corresponding results obtained here for sets of randomly selected genes (used as negative control) and results for known regulons (positive control) obtained in another study [26].

A first observation is that the fraction of complexes for which regulatory patterns are identified with some reliability is quite low. No more than 27-28% of the complexes from either of the three analyzed datasets have at least one pattern with statistical significance Sig ≥ 1 (corresponding to an E-value ≤ 0.1). At this threshold the fraction of complexes with identified patterns is nonetheless about 7-10% higher than for gene groups selected at random. With the more stringent significance threshold (Sig ≥ 2), the fraction of complexes with at least one pattern drops further, but less for the curated (15%) and TAP complexes (13%), than for the HMS complexes (8%).

We recently applied the same algorithms to the dataset of annotated regulons [26]. As the genes belonging to the same regulon are expected to be co-regulated and hence to exhibit common regulatory-sequence patterns, our algorithms should perform well on these genes. This was indeed the case. Patterns with Sig = 1 were identified in as many as 84% of the annotated regulons, as illustrated in Figure 5.

The fraction of the complexes in which regulatory patterns can be reliably detected is thus clearly much smaller, confirming that the components of complexes are on average much less consistently co-regulated than the genes that belong to known regulons.

Assigning components of protein complexes to putative regulons on the basis of predicted patterns

Having shown that highly reliable regulatory patterns can be detected in genes corresponding to at least a fraction of the complexes, we now proceeded to determine, for each complex, which of its components are likely to be co-regulated, and what fraction of the complex they represent. To this end, complexes with at least five component genes, featuring at least one significant pattern (Sig ≥ 1), are selected. A stepwise linear discriminant analysis [38] with a leave-one-out procedure is applied to assign a gene involved in a given complex, either to its original complex or to a control group of randomly selected genes, according to the number of occurrences of the discovered patterns in its upstream region. The assigned group (complex or control) is then compared to the group from which the gene was drawn to evaluate the coverage and positive predictive power (PPP) of the assignment. Coverage is defined as the fraction of the total number of genes in the complex that were reassigned to it by the discriminant procedure. PPP is defined as the fraction of total number of genes assigned to the complex that actually belong to it (see Materials and methods for details).

Figure 6 displays the coverage versus PPP values for a total of 140 individual complexes from the three datasets analyzed here (34 TAP, 75 HMS, and 31 annotated ones). The coverage obtained for these complexes has a mean value of 48%, and a standard deviation of about 25%. The mean PPP is 80%, with a standard deviation of about 10%, and only a single case with perfect assignment (PPP = 100%). There is very little difference between the results obtained for the annotated, TAP, and HMS complexes (see Additional data file 2 (Table S3) for details). It is noteworthy that significantly higher average values for the coverage and PPP (72% and 92% respectively) were obtained by applying the same procedure to the annotated regulons [26].

Putative regulons in the annotated complexes

We determined whether the putative regulons identified by our procedures can provide useful information on the transcriptional regulation of protein complexes. As a first step, we discuss several aspects of the prediction results for patterns and putative regulons obtained for the annotated complexes, summarized in Table 2. This lists the results for all the complexes for which at least one statistically significant (Sig ≥ 1) regulatory pattern has been detected. A complete list of the predicted co-regulated components in each of the complexes considered is given in Additional data file 2 (Table S4).

Table 2 Results of the pattern discovery and discriminant analysis for annotated complexes

Full size table

Table 2 reveals a clear difference between results for the permanent and the non-permanent complexes. Most strikingly, the fraction of the components of a given complex covered by our putative regulons is noticeably higher for most permanent complexes (0.7-1.0) than for the non-permanent ones (0.06-0.6). The number of significant regulatory patterns and the significance value of the 'best' pattern are also generally higher in theses complexes. Among the complexes with the highest coverage by putative regulons and a large number of statistically significant patterns we find the proteasome, the large and small subunits of the cytoplasmic ribosome, three complexes of the respiratory chain, the translation elongation complex, as well as the nucleosomal protein and cyclin Cdc28p complexes. To illustrate the information provided by our approach, we will discuss in detail our findings for the nucleosomal protein complex and the replication fork complexes.

Nucleosomal protein complex

This complex has all of its eight components predicted to be part of a regulon, with a large number (20) of significant patterns. Details of the patterns discovered, of which the most statistically significant are spaced dyads, and their locations in the upstream regions of the corresponding genes are shown in the feature map (Figure 7). All but one of these dyads are mutually overlapping, and can be aligned to form the larger motif cGCGAan{5}caGAACg, where upper-case letters denote the most conserved residues, which seem to be the 'core' of the binding site, and the number in brackets is the length of the spacer in terms of the number of intervening nucleotides. The feature map shows that each upstream sequence contains at least two occurrences of this 'core', with some differences in the bases flanking this core. Although several regulons - Hta1/Hta2, Hir2/Hir3/Hir4 and Spt10/Spt21 - are known (and were found here) to map into this complex, covering a total of seven out of the eight components of the complex (Figures 1, 2), our findings represent the first instance where a regulatory motif is proposed for all the members of the nucleosomal complex.

Replication fork complexes

The replication fork complex is an assembly of proteins involved in DNA replication (Table 2). It is encoded by a total of 30 genes, which can be subdivided into several smaller complexes such as the DNA polymerase δ, DNA polymerase ε, DNA α1 primase and replication factor C complexes. Analysis of the entire assembly detected 12 patterns with a maximum significance of 13.3, corresponding to an E-value of 2 × 10^-13. The discriminant analysis carried out on the basis of these patterns allowed us to assign about half (17) of the 30 components of this assembly to putative regulons (Table 2).

Table 3 lists the probabilities for individual components to be assigned to the complex by the discriminant analysis. It reveals a striking observation: the predicted co-regulated genes correspond almost perfectly to seven out of the 14 individual complexes or entities that make up the assembly. The 17 genes that belong to the putative regulons include three of the four components of the DNA polymerase α1 primase complex, all the components of the DNA δ and ε complexes, the replication factor A and topoisomerase complexes, as well as the proliferating cell nuclear antigen (PCNA) and exonucleases. Furthermore, the majority of these genes were assigned to the replication fork assembly with high probability (0.8-0.99).

Table 3 Details of the discriminant analysis results for the replication fork complexes

Full size table

Interestingly, Jansen et al. [7] reported a poor correlation with expression data for the replication complex, a large complex that partially overlaps the replication fork assembly discussed above. They showed, however, that a much better correlation with expression profiles could be obtained for subgroups of components within the complex. Two such subgroups were the DNA polymerase δ and DNA polymerase ε complexes. These are also part of the co-regulated subgroup that we identify in the replication fork assembly, except that our observations suggest that their transcriptional regulation is coupled to that of 11 other genes (the remaining genes with probabilities > 0.5 in Table 3) also involved in DNA replication, and that the transcriptional regulation of all 17 genes is controlled by the same factor, or set of factors.

Our findings therefore agree with the conclusions drawn from relationships between complexes and mRNA expression profiles, namely that the so-called permanent complexes display a more marked trend to be co-regulated en bloc than non-permanent ones, but that non-permanent complexes often contain subgroups of co-regulated components [7]. Our findings furthermore suggest that subsets of the components of a given multiprotein assembly, whose genes display distinct expression patterns, can be under common transcriptional control. This can happen when the same transcription factor requires the presence of different associated factors when binding to each group of co-regulated components, thereby producing different expression profiles for the individual gene groups (see, for example [39]).

Putative regulons in the high-throughput complexes

Interpretation of the predicted regulons for the TAP and HMS complexes is more complicated. Owing to the appreciable overlap between the complexes, significant overlap is also observed between the predicted sets of regulatory patterns and between the corresponding groups of putative co-regulated components in different complexes. These overlaps create a dense network of connections between the complexes. The bulk of the network comprises two large separate clusters of interconnected complexes and a few very small clusters. The complete list of predicted co-regulated components for each complex, and the network graph can be found in Additional data file 2 (Table S4) and Additional data file 1 (Figure S6).

One of the largest clusters comprises a group of nine complexes (four HMS and five TAP) that involve the proteasome genes. The subsets of putative co-regulated genes in each complex and the pattern of overlap between them are illustrated in Figure 8. Remarkably, the vast majority of the genes predicted as being co-regulated in all the nine complexes are annotated as belonging to the proteasome, whereas the majority of remaining genes, not classified with putative regulons, are not annotated as proteasome genes. For most complexes, the fraction of genes assigned to regulons is furthermore quite high (65% on average). The most statistically significant regulatory patterns shared by the genes in the corresponding complexes (Table 4) display a substantial degree of overlap and can be assembled into a larger pattern (TTTGCCACC/GGTGGCAAA). This pattern corresponds to the binding site of the Rpn4 transcription factor, known to be involved in the regulation of proteasome genes [40]. It is present in the upstream regions of nearly all the proteasome genes, with the exception of RPN10, RPN13, PRE5, PUP1, RPN8 and NAS6. But the latter gene group exhibits patterns differing by only one nucleotide from the Rpn4-binding sequence, and might hence still be regulated by this factor or a closely related one. Thus, quite strikingly, the overlap between the complexes and putative regulons is much more extensive here than in the proteasome-related cluster shown in Figures 3 and 4a, which links TAP and HMS complexes to annotated regulons.

Table 4 Regulatory patterns discovered in the protein complexes of the proteasome system

Full size table

Interestingly, the TAP126 and TAP151 complexes, which are part of the proteasome cluster in Figure 8, stand out as the two complexes for which the predicted regulons are less consistent with the functional annotations. We see, however, that these regulons include at most a third of the components of the complexes, and that the PPPs of regulon assignment for these complexes are significantly below average (62% and 69%, respectively), suggesting that the corresponding predictions might be less reliable. On the other hand, two genes of unknown function (YHR033W and YLR199C) are predicted to be co-regulated with other proteasome genes in the HMS233 complex, for which the coverage and PPP values are high (87% and 88%, respectively). The possibility that these orphan genes might be co-regulated with other proteasome components is therefore a reasonable hypothesis that can be tested experimentally.

These various observations indicate that combining pattern discovery and discriminant analysis to predict regulons should be a promising avenue for the interpretation of high-throughput datasets on protein complexes, in terms of biological function and transcriptional regulation.

Discussion

We used two approaches to investigate transcriptional regulation of multiprotein complexes in yeast. First, sets of genes representing known targets of yeast transcription factors, the regulons, were mapped onto both manually curated protein complexes and those identified by genome-wide pull-down experiments. Second, a predictive approach was applied to the same set of complexes. In this approach, string-based techniques for the discovery of regulatory patterns were combined with a discriminant analysis in order to assign genes to regulons on the basis of the predicted patterns.

The straightforward mapping approach revealed that only very few of the complexes had a good fraction of their components belonging to known regulons, whereas for the majority of the complexes only a very partial overlap with regulons was detected. Following previous work [7], we subdivided the complexes into two categories: permanent complexes that exist under a wide range of cellular conditions; and transient complexes that form under particular conditions only. Overall, we found little difference in the overlap with regulons between complexes of the two categories, except maybe when matching the annotated complexes against the high-throughput regulons dataset. In the latter case, a more extensive overlap with regulons was observed for the permanent complexes than for the transient ones (Table 1b). But this observation should be considered with caution, taking into account the sizable level of noise in the corresponding regulon dataset.

Considering that the above results mainly reflect the limited information currently available on transcription factor binding sites in yeast, we applied our predictive approach. The pattern-discovery procedures identified statistically significant regulatory patterns in only a small fraction of the complexes (8-15%). Putative regulons were identified in this small fraction on the basis of the patterns discovered, in agreement with our analysis of the overlaps between complexes and known regulons. The identified regulons included on average nearly half (47%) of the components in the annotated complexes, with, nonetheless, a large spread in component coverage, ranging between 6% and 100% of the proteins/genes of a complex.

More interestingly, in annotated complexes in which regulatory patterns were predicted, the fraction of the components belonging to putative regulons, the number of regulatory patterns, and the statistical significance value for the 'best' pattern in each complex were generally substantially higher for the permanent complexes than for complexes from other categories. This result is in good agreement with previous reports on the better correlation between the mRNA expression profiles of genes coding for the components of permanent complexes than for their transient counterparts [7]. It suggests, furthermore, that complexes such as the proteasome, the respiratory-chain complexes, the cytoplasmic ribosome or the nucleosomal protein complex, which are present under a range of different cellular conditions, are actively regulated at the transcriptional level.

Detailed analysis of the predicted regulons in some of the non-permanent complexes, such as the replication fork complexes (Table 3), also suggests that even in cases in which only a fraction of the components belong to regulons, these components are likely to be co-regulated, either together or in smaller subgroups, thereby revealing the existence of transcriptionally regulated modules within complexes. Interestingly, in the case of the replication fork complex, the detected modules contain groups of genes reported to display distinct mRNA expression patterns [7], an indication that gene groups with different expression patterns might nevertheless be under common transcriptional control. These findings are clearly preliminary and should be confirmed by a systematic comparison with mRNA expression data for genes involved in all the complexes for which putative regulons were identified. In particular, this comparison should consider separately the mRNA expression profiles measured under different experimental conditions rather than combine them as was done previously [7], as we might expect the transcriptional regulation of certain complexes to vary with these conditions.

It is useful at this point to discuss the reliability of our prediction procedure. Recent benchmarks performed on the annotated regulons in Saccharomyces cerevisiae [26], showed that our combined approach produces highly specific regulon assignments, with good coverage. On average, 91% of the genes assigned to a regulon were actually part of it, indicating that our approach produces a very low rate of false positives, and hence that it would rarely assign to a regulon genes that do not belong to it. The coverage, or the fraction of the genes in a regulon that could be reassigned to it, was lower (73% on average), suggesting that the regulons detected by our approach might not include all their actual members, a much less serious shortcoming than making false-positive predictions.

More important still, the same benchmarks showed that our approach maintained on average a similar level of coverage and reached PPP values as high as 84% when applied to gene groups in which known regulons were mixed with an equal number of random genes, a situation more closely resembling that of the multiprotein complexes analyzed here. We therefore believe that our predictive methods yield quite reliable regulon predictions, particularly for larger complexes in which only about 50% or more of the genes may be co-regulated.

Our methods do, however, have obvious limitations. One is that they would miss altogether sites located outside the 800 base-pair (bp) limit of the analyzed upstream sequences. Another potential shortcoming is their limited capacity to detect regulatory sites that have a high degree of sequence variability, something that matrix-based methods could do better [27–31]. It could therefore be possible that we tend to find fewer putative regulons in the so-called transient complexes than in their permanent counterparts because the regulatory mechanisms of these complexes are more specific, involving a larger variety of transcription factors or a more diverse pattern of interactions.

On the other hand, a clear advantage of our pattern-discovery procedure is its low rate of false-positive predictions. As a result, the most reliable patterns discovered here should be a good starting point for deriving the full regulatory motif. This requires that the discovered patterns be extended and their sequence degeneracy characterized [41, 42], a difficult task that is often considered an obligatory step in identifying regulons. Here, this difficulty was circumvented by our cross-validated discriminant analysis, which we applied directly to the small sequence motifs. Combining the two approaches in the future should lead to improvements.

Lastly, we show that useful information could also be obtained by building the network graph representing the multiple links between protein complexes and predicted regulons, and analyzing the common genes participating in these links (Additional data file 1 (Figure S6)). This was illustrated for one of the large clusters of this graph, representing the proteasome-related complexes (Figure 7), but was also observed for other regulon-complex relationships discovered in this study (data not shown). Such networks are complementary to the recently described regulatory network of gene-transcription factor interactions [16, 43] and should provide insights into the functional relationships between complexes.

The various observations made here are consistent with the results of a related study [44], in which pattern-discovery methods were applied to sets of yeast genes and those sets with common patterns in their upstream regions were scored against interacting proteins from the TAP and HMS protein complexes or closely linked proteins in the metabolic network.

Materials and methods

Data on multiprotein complexes

Three different datasets on protein complexes from the yeast S. cerevisiae were analyzed. One comprises a set of 243 protein complexes manually curated from the literature and retrieved from the complexes catalog in CYGD [10, 45]. Those are referred to as annotated complexes. This set includes complexes that are known to form a single physical entity under some experimental conditions, as well as larger assemblies composed of several complexes whose formation is thought to be interdependent. Components of complexes encoded by mitochondrial genes were not considered, as the analysis of their regulatory motifs requires a different background model than for the nuclear genes. The complete list of annotated complexes used in this study is given in Additional data file 2 (Table S1).

The other two datasets comprise a total of 725 protein complexes identified respectively in the tandem affinity purification and MS analysis (TAP, 232 complexes) [11], and in the high-throughput MS protein complex identification (HMS, 493 complexes) [12]. These analyses probed only a subset of the proteins comprising many of the phosphatases, kinases and proteins involved in DNA repair. This subset (which corresponds to the so-called bait proteins [11, 12]) represents about 25% of all yeast proteins in the TAP study, and about 10% in the HMS study. The composition of individual complexes from these studies was downloaded from the websites indicated in the original papers [46, 47].

Data on co-regulated genes

Information on co-regulated genes in S. cerevisiae was obtained from two types of data on gene-transcription factor associations. One corresponds to 1,406 gene-factor associations manually curated from the literature, which were obtained from the TRANSFAC [18] and aMAZE [21, 22] databases, from the list compiled by Young and colleagues, which excludes gene-factor associations deduced from sequence analysis [16, 48], and from additional literature searches.

The manually curated gene-factor associations were grouped into 200 regulons, with on average seven genes per regulon, and are referred to as annotated regulons in this study. A regulon is defined here as the set of genes that bind the same transcription factor, and is denoted by the name of the transcription factor polypeptide. We considered individual polypeptides as distinct transcription factors, ignoring the fact that the actual active species might in some cases be a complex of several polypeptides (for example Hap2, Hap3, Hap4 [49]).

The second set of gene-factor associations is data obtained by Lee et al. [16] using a high-throughput ChIP approach under specific cellular and culture conditions; it was downloaded from the authors' website [50]. Here we consider only the associations with the highest statistical significance (P-value ≤ 10^-3), as defined by Lee et al. [16]. Those number 4,433, and correspond to 106 distinct regulons, with 41.8 genes on average. Only 185 of these gene-factor associations are the same as those of the manually curated dataset.

Correspondence between protein complexes and regulons

To evaluate the correspondence between protein complexes and regulons, their gene compositions are compared and a statistical significance criterion (E-value) is computed using the hypergeometric formula implemented in the software Compare-Classes. The complexes and the regulons are considered as independent samples of genes from the complete yeast genome (n = 6,450). For a complex containing a genes, and a regulon b genes, the probability of finding exactly c common genes between them is

,

where

is the binomial coefficient. The probability of observing at least c genes in common by chance is given by

.

To correct for multi-testing (a given complex is compared to several hundred regulons), P is converted to an E-value (expected value), E-value = R * P(X ≥ c), where R is the total number of regulons.

Pattern discovery in upstream sequences

Upstream sequences are analyzed using the regulatory sequence analysis tools (RSAT) [51, 52]. For each complex, the corresponding upstream sequences are retrieved over, at most, 800 bp from the start codon, clipping the sequence when necessary to avoid including upstream open reading frames (ORFs). Large redundant fragments, which may result from very recent duplications or from the presence of two genes transcribed in opposite directions, are discarded using Mkvtree and Vmatch [53].

Two pattern-discovery algorithms are applied to each sequence set to detect oligonucleotides [24], and dyads (spaced pairs of short oligonucleotides) [25], respectively, which occur more frequently (are over-represented) in these regions in comparison to upstream regions of all the genes from the S. cerevisiae nuclear genome. The degree of over-representation of a given pattern is determined by a statistical significance criterion Sig = -log(E-value). The larger the co-regulated gene group, the more reliable the prediction becomes. From previous experience we chose to analyze only gene groups with at least five members.

Discriminant analysis

Linear discriminant analysis [38] is used to classify the genes involved in a given set (protein complex or regulon) according to the type and number of statistically significant oligonucleotide and dyad sequence patterns that occur in their upstream regions.

A detailed description of the approach is given elsewhere [26]. In summary, two gene groups are defined. Group 1 comprises the g genes of a given protein complex, in which p over-represented patterns are identified. Group 2 is a control group of 3g genes, selected at random from the yeast nuclear genome. Upstream regions of the genes in both groups are analyzed to count the number of occurrences for all the identified p patterns, taken as variables. A linear discriminant function is then built, which optimally separates the genes from groups 1 and 2 into their respective groups in the p-dimensional space of the considered variables. To avoid over-fitting, the most discriminant variables are identified in a stepwise approach.

To assign individual genes from a given complex to either group, a leave-one-out procedure is carried out, whereby the genes are removed from the complex one at a time, a discriminant function is built each time with a different gene removed and used to assign the removed gene to groups 1 or 2.

To minimize fluctuation in the results, the entire procedure is repeated 100 times for each complex, using different random selections of genes for group 2, and the probability that a given gene is part of the complex is computed as the mean of the posterior probabilities evaluated in all the trials. Genes with mean posterior probability > 0.5 are assigned to the group of putative co-regulated genes, or regulons. The discriminant analysis was performed with the R package [54].

To assess the performance of the discriminant analysis procedure the group (complex or control) to which a given gene has been assigned is compared to the group from which it was originally drawn and two quantities are evaluated. One is the coverage, defined as TP/(TP + FN), where TP is the number of genes assigned to a given complex that were originally part of it, and FN is the number of genes that are part of the original complex, but which were (incorrectly) assigned to the control group. The other is the positive predictive power (PPP), defined as TP/(TP + FP), with FP being the number of genes that were (incorrectly) assigned to the complex.

Additional data files

The following additional data files are available with the complete version of this article, online: a PDF file (Additional data file 1) containing supplementary Figures S1 to S6 and their legends, and a Word file (Additional data file 2) containing supplementary Tables S1 to S4 and their legends.

References

Neubauer G, King A, Rappsilber J, Calvio C, Watson M, Ajuh P, Sleeman J, Lamond A, Mann M: Mass spectrometry and EST-database searching allows characterization of the multi-protein spliceosome complex. Nat Genet. 1998, 20: 46-50. 10.1038/1700.
Article CAS PubMed Google Scholar
Zachariae W, Shin TH, Galova M, Obermaier B, Nasmyth K: Identification of subunits of the anaphase-promoting complex of Saccharomyces cerevisiae. Science. 1996, 274: 1201-1204. 10.1126/science.274.5290.1201.
Article CAS PubMed Google Scholar
Rout MP, Aitchison JD, Suprapto A, Hjertaas K, Zhao Y, Chait BT: The yeast nuclear pore complex: composition, architecture, and transport mechanism. J Cell Biol. 2000, 148: 635-651. 10.1083/jcb.148.4.635.
Article PubMed Central CAS PubMed Google Scholar
Huynen M, Snel B, Lathe W, Bork P: Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 2000, 10: 1204-1210. 10.1101/gr.10.8.1204.
Article PubMed Central CAS PubMed Google Scholar
Kalir S, McClure J, Pabbaraju K, Southward C, Ronen M, Leibler S, Surette M, Alon U: Ordering genes in a flagella pathway by analysis of expression kinetics from living bacteria. Science. 2001, 292: 2080-2083. 10.1126/science.1058758.
Article CAS PubMed Google Scholar
Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet. 2001, 29: 482-486. 10.1038/ng776.
Article CAS PubMed Google Scholar
Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res. 2002, 12: 37-46. 10.1101/gr.205602.
Article PubMed Central CAS PubMed Google Scholar
Teichmann SA, Babu MM: Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 2002, 20: 407-410. 10.1016/S0167-7799(02)02032-2.
Article CAS PubMed Google Scholar
Gerstein M, Jansen R: The current excitement in bioinformatics - analysis of whole-genome expression data: how does it relate to protein structure and function?. Curr Opin Struct Biol. 2000, 10: 574-584. 10.1016/S0959-440X(00)00134-2.
Article CAS PubMed Google Scholar
Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B: MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2002, 30: 31-34. 10.1093/nar/30.1.31.
Article PubMed Central CAS PubMed Google Scholar
Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002, 415: 141-147. 10.1038/415141a.
Article CAS PubMed Google Scholar
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415: 180-183. 10.1038/415180a.
Article CAS PubMed Google Scholar
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.
Article PubMed Central CAS PubMed Google Scholar
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109-126. 10.1016/S0092-8674(00)00015-5.
Article CAS PubMed Google Scholar
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11: 4241-4257.
Article PubMed Central CAS PubMed Google Scholar
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002, 298: 799-804. 10.1126/science.1075090.
Article CAS PubMed Google Scholar
Lee TI, Young RA: Transcription of eukaryotic protein-coding genes. Annu Rev Genet. 2000, 34: 77-137. 10.1146/annurev.genet.34.1.77.
Article CAS PubMed Google Scholar
Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, Pruss M, Reuter I, Schacherer F: TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 2000, 28: 316-319. 10.1093/nar/28.1.316.
Article PubMed Central CAS PubMed Google Scholar
Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. 1999, 15: 607-611. 10.1093/bioinformatics/15.7.607.
Article CAS PubMed Google Scholar
Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, et al: YPD, PombePD and WormPD: model organism volumes of the BioKnowledge library, an integrated resource for protein information. Nucleic Acids Res. 2001, 29: 75-79. 10.1093/nar/29.1.75.
Article PubMed Central CAS PubMed Google Scholar
van Helden J, Naim A, Mancuso R, Eldridge M, Wernisch L, Gilbert D, Wodak SJ: Representing and analysing molecular and cellular function using the computer. Biol Chem. 2000, 381: 921-935.
CAS PubMed Google Scholar
van Helden J, Naim A, Lemer C, Mancuso R, Eldridge M, Wodak S: From molecular activities and processes to biological function. Brief Bioinform. 2001, 2: 81-93.
Article CAS PubMed Google Scholar
Maas WK: Studies on the mechanism of repression of arginine biosynthesis in Escherichia coli. II. Dominance of repressibility in diploids. J Mol Biol. 1964, 8: 365-370.
Article CAS PubMed Google Scholar
van Helden J, Andre B, Collado-Vides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol. 1998, 281: 827-842. 10.1006/jmbi.1998.1947.
Article CAS PubMed Google Scholar
van Helden J, Rios AF, Collado-Vides J: Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 2000, 28: 1808-1818. 10.1093/nar/28.8.1808.
Article CAS PubMed Google Scholar
Simonis N, Wodak SJ, Cohen GN, Van Helden J: Combining pattern discovery and discriminant analysis to predict gene co-regulation. Bioinformatics. 2004, DOI: 10.1093/bioinformatics/bth252
Google Scholar
Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol. 1994, 2: 28-36.
CAS PubMed Google Scholar
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993, 262: 208-214.
Article CAS PubMed Google Scholar
Neuwald AF, Liu JS, Lawrence CE: Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 1995, 4: 1618-1632.
Article PubMed Central CAS PubMed Google Scholar
Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol. 1998, 16: 939-945.
Article CAS PubMed Google Scholar
Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics. 2001, 17: 1113-1122. 10.1093/bioinformatics/17.12.1113.
Article CAS PubMed Google Scholar
Jones EW, Pringle JR, Broach JR: The Molecular and Cellular Biology of the Yeast Saccharomyces: Gene Expression. 1992, New York: Cold Spring Harbor Laboratory Press
Google Scholar
Bucher P: Regulatory elements and expression profiles. Curr Opin Struct Biol. 1999, 9: 400-407. 10.1016/S0959-440X(99)80054-2.
Article CAS PubMed Google Scholar
Krause R, von Mering C, Bork P: A comprehensive set of protein complexes in yeast: mining large scale protein-protein interaction screens. Bioinformatics. 2003, 19: 1901-1908. 10.1093/bioinformatics/btg344.
Article CAS PubMed Google Scholar
Manke T, Bringas R, Vingron M: Correlating protein-DNA and protein-protein interaction networks. J Mol Biol. 2003, 333: 75-85. 10.1016/j.jmb.2003.08.004.
Article CAS PubMed Google Scholar
Coux O, Tanaka K, Goldberg AL: Structure and functions of the 20S and 26S proteasomes. Annu Rev Biochem. 1996, 65: 801-847. 10.1146/annurev.bi.65.070196.004101.
Article CAS PubMed Google Scholar
Schauber C, Chen L, Tongaonkar P, Vega I, Lambertson D, Potts W, Madura K: Rad23 links DNA repair to the ubiquitin/proteasome pathway. Nature. 1998, 391: 715-718. 10.1038/35661.
Article CAS PubMed Google Scholar
Huberty CJ: Applied Discriminant Analysis. 1994, New York: Wiley
Google Scholar
Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nat Genet. 2001, 28: 327-334. 10.1038/ng569.
Article CAS PubMed Google Scholar
Mannhaupt G, Schnall R, Karpov V, Vetter I, Feldmann H: Rpn4p acts as a transcription factor by binding to PACE, a nonamer box found upstream of 26S proteasomal and other genes in yeast. FEBS Lett . 1999, 450: 27-34. 10.1016/S0014-5793(99)00467-6.
Article CAS PubMed Google Scholar
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.
Article CAS PubMed Google Scholar
Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M: Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003, 301: 71-76. 10.1126/science.1084337.
Article CAS PubMed Google Scholar
Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U: Network motifs: simple building blocks of complex networks. Science. 2002, 298: 824-827. 10.1126/science.298.5594.824.
Article CAS PubMed Google Scholar
Ettwiller LM, Rung J, Birney E: Discovering novel cis-regulatory motifs using functional networks. Genome Res. 2003, 13: 883-895. 10.1101/gr.866403.
Article PubMed Central CAS PubMed Google Scholar
Comprehensive Yeast Genome Database: yeast cellular substructures, protein complexes. [http://mips.gsf.de/proj/yeast/catalogues/complexes]
MDS Proteomics: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. [http://www.mdsproteomics.com/yeast]
YEAST protein complex database. [http://yeast.cellzome.com]
Previous evidence. [http://staffa.wi.mit.edu/cgi-bin/young_public/navframe.cgi?s=17&f=evidence]
Gancedo JM: Yeast carbon catabolite repression. Microbiol Mol Biol Rev. 1998, 62: 334-361.
PubMed Central CAS PubMed Google Scholar
Transcriptional Regulatory Networks. [http://web.wi.mit.edu/young/regulator_network/]
van Helden J: Regulatory Sequence Analysis Tools. Nucleic Acids Res. 2003, 31: 3593-3596. 10.1093/nar/gkg567.
Article PubMed Central CAS PubMed Google Scholar
Regulatory Sequence Analysis Tools (RSAT). [http://rsat.ulb.ac.be/rsat/]
Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 2001, 29: 4633-4642. 10.1093/nar/29.22.4633.
Article PubMed Central CAS PubMed Google Scholar
The R Project for Statistical Computing. [http://www.r-project.org]

Download references

Acknowledgements

We thank Michael Holger and Edgar Wingender for access to TRANSFAC, and Jean Richelle and Raphael Leplae for help with the computer systems. N.S. is grateful to Didier Croes for Java lessons and support. This work was funded by the European Communities, grant QLRI-199-01333 to the CYGD project, by the Action de Recherches Concertées de la Communauté Française de Belgique, and by the Government of the Brussels Region.

Author information

Authors and Affiliations

Service de Conformation des Macromolécules Biologiques, Centre de Biologie Structurale et Bioinformatique, CP 263, Université Libre de Bruxelles, Bld du Triomphe, B-1050, Bruxelles, Belgium
Nicolas Simonis, Jacques van Helden & Shoshana J Wodak
Institut Pasteur, Unité d'Expression des Gènes Eucaryotes, Institut Pasteur, rue du Docteur Roux, 75724, Paris Cedex 15, France
George N Cohen

Authors

Nicolas Simonis
View author publications
You can also search for this author in PubMed Google Scholar
Jacques van Helden
View author publications
You can also search for this author in PubMed Google Scholar
George N Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Shoshana J Wodak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shoshana J Wodak.

Electronic supplementary material

Additional data file 1: A PDF file containing supplementary Figures S1 to S6 and their legends (PDF 1 MB)

Additional data file 2: A Word file containing supplementary Tables S1 to S4 and their legends (DOC 1 MB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Simonis, N., van Helden, J., Cohen, G.N. et al. Transcriptional regulation of protein complexes in yeast. Genome Biol 5, R33 (2004). https://doi.org/10.1186/gb-2004-5-5-r33

Download citation

Received: 26 November 2003
Revised: 30 March 2004
Accepted: 06 April 2004
Published: 30 April 2004
DOI: https://doi.org/10.1186/gb-2004-5-5-r33

Transcriptional regulation of protein complexes in yeast

Abstract

Background

Results

Conclusions

Background

Results

Correspondence between multiprotein complexes and known regulons

Correspondence between regulons and annotated protein complexes

Zoom-in on the overlaps between regulons and annotated complexes

Correspondence between regulons and high-throughput protein complexes

Prediction of cis-acting regulatory elements in genes of multiprotein complexes

Highly significant patterns are detected for only a small subset of the complexes

Assigning components of protein complexes to putative regulons on the basis of predicted patterns

Putative regulons in the annotated complexes

Nucleosomal protein complex

Replication fork complexes

Putative regulons in the high-throughput complexes

Discussion

Materials and methods

Data on multiprotein complexes

Data on co-regulated genes

Correspondence between protein complexes and regulons

Pattern discovery in upstream sequences

Discriminant analysis

Additional data files

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Biology

Contact us