Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering
© Gasch and Eisen, licensee BioMed Central Ltd 2002
Received: 1 August 2002
Accepted: 11 September 2002
Published: 10 October 2002
Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing whole-genome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed.
We used a heuristically modified version of fuzzy k-means clustering to identify overlapping clusters of yeast genes based on published gene-expression data following the response of yeast cells to environmental changes. We have validated the method by identifying groups of functionally related and coregulated genes, and in the process we have uncovered new correlations between yeast genes and between the experimental conditions based on similarities in gene-expression patterns. To investigate the regulation of gene expression, we correlated the clusters with known transcription factor binding sites present in the genes' promoters. These results give insights into the mechanism of the regulation of gene expression in yeast cells responding to environmental changes.
Fuzzy k-means clustering is a useful analytical tool for extracting biological insights from gene-expression data. Our analysis presented here suggests that a prevalent theme in the regulation of yeast gene expression is the condition-specific coregulation of overlapping sets of genes.
The yeast Saccharomyces cerevisiae evolved in a niche in which the availability of nutrients and the conditions of growth vary constantly, and it possesses sophisticated mechanisms to choreograph the expression of its approximately 6,000 genes in order to thrive - or at least survive - in a wide range of environmental conditions. These responses are governed by a complex, condition-specific regulatory system that transduces information through the cell to the nucleus, where gene expression is adjusted accordingly. Many of the individual components of this regulatory system function under particular conditions and govern the expression of overlapping sets of gene targets, allowing a given gene to be coexpressed with different gene groups in response to different conditions (Figure 1a). As a consequence, the targets of each regulatory system often display similar expression patterns in response to one set of conditions but divergent patterns under other situations (Figure 1b). For example, the known targets of the oxidative stress-responsive transcription factor Yap1p are coordinately induced in response to conditions that inflict oxidative damage, but these genes are divergently expressed in response to other environmental changes (Figure 1c) . Similarly, the known targets of other transcription factors in yeast (including Aft1p, Zap1p, Pho4p, Hac1p, Hsf1p, and others) are similarly expressed only in response to certain environments [2,3,4,5,6].
Fuzzy k-means clustering  facilitates the identification of overlapping groups of objects by allowing the objects to belong to more than one group. The essential difference between fuzzy k-means clustering and standard k-means clustering is the partitioning of genes into each group (Figure 2b). Rather than the hard partitioning of standard k-means clustering, where genes belong to only a single cluster, fuzzy k-means clustering considers each gene to be a member of every cluster, with a variable degree of 'membership'. Each gene has a total membership of 1.0 that is apportioned to clusters on the basis of the similarity between the gene's expression pattern and that of each cluster centroid. Genes whose expression patterns are very similar to a given centroid will be assigned a high membership in that cluster, whereas genes that bear little similarity to the centroid will have a low membership. Importantly, genes can be assigned significant memberships to more than one cluster, thus revealing genes whose expression is similar to multiple, distinct groups of genes.
We implemented a heuristic variant of fuzzy k-means clustering that incorporated principal component analysis (PCA) and hierarchical clustering to analyze published yeast genomic expression data that followed the response of cells to different environments. The method successfully identified clusters of functionally related genes and more comprehensive groups of known transcription factor targets in yeast. In the process of this analysis, we identified previously unrecognized similarities in the expression of yeast genes and uncovered correlations between the environmental conditions. We explored the regulation of gene expression by correlating the identified clusters with known regulatory elements present in the genes' promoters. These details implicate mechanisms that yeast cells use to orchestrate genomic expression programs in response to variable conditions.
Fuzzy k-means clustering overview
We implemented a version of the fuzzy k-means algorithm, based on a description by Gath and Geva , in a C++ program called FuzzyK (available at ; see Materials and Methods for complete details). We altered the algorithm in two fundamental ways: first, we performed three successive cycles of fuzzy k-means clustering, with the second and third rounds of clustering performed on subsets of the data. Second, because the random initialization commonly used in k-means clustering can have a profound impact on the results , we instead chose to initialize each clustering cycle by seeding prototype centroids with the eigen vectors identified by PCA of the respective dataset (see below). Here we present an overview of the algorithm, followed by a discussion of the parameter optimization.
After this initial round of fuzzy clustering, duplicate centroids (pairs whose Pearson correlation is greater than 0.9) are averaged, and genes with a greater than 0.7 correlation to any of the identified centroids are removed from the dataset (see Materials and methods). The fuzzy k-means clustering steps described above are repeated on this smaller dataset to identify patterns missed in the first clustering cycle, and the new centroids are added to the set identified in the first round. The process of averaging replicated centroids and selecting a data subset is repeated, and a third cycle of clustering is performed on the subset of genes with a correlation of less than 0.7 to any of the existing centroids. The newly identified centroids are combined with the previous sets, and replicate centroids are averaged.
In the final step of the program, the membership of each gene to each centroid is calculated. Thus, the output of the algorithm is twofold: the method presents a list of the unique centroids identified in the fuzzy clustering cycles along with a matrix representing the final membership scores for each gene to each centroid. In this representation, each gene can be related to all the identified clusters through its membership value, allowing genes that belong significantly to multiple clusters to be realized. As a consequence, each cluster consists of a continuous list of all of the genes in the dataset, ranked according to decreasing membership.
The continuous clusters identified by fuzzy k-means clustering present a challenge in visualizing the clustering results. To this end, we have developed the program FuzzyExplorer, a PERL viewer based on the program GeneExplorer (developed by Christian Rees; C. Rees, P.O. Brown and D. Botstein, unpublished results). Using this software, the genes that belong significantly to each cluster can be identified and visualized by applying a membership cutoff: all genes whose membership is greater than the cutoff will be selected as part of the cluster and their gene-expression patterns will be displayed. Rather than define a single cutoff for each cluster, the visualization software applies a sliding membership cutoff to select the genes, allowing each cluster to be expanded or collapsed in terms of the number of genes selected. This flexibility allows the user to define the appropriate membership cutoff for each cluster. For example, at a very high membership cutoff, most of the genes in each cluster will have highly correlated expression patterns in all of the experiments and will be closely related in terms of function and regulation. As the membership cutoff decreases, additional genes will be assigned to each cluster group: in many cases, the similarity in the expression of the selected genes will exist over only a subset of the microarray experiments, promoting the identification of conditionally coregulated genes or genes whose products are more peripherally associated with the same cellular processes (see below). The appropriate membership cutoff will vary for each cluster and for the desired results, and selecting meaningful cutoffs can be guided by additional information (see Discussion). For many of the clusters discussed below, the membership cutoffs were empirically chosen to select genes that had coherent gene expression patterns over a given subset of the experiments.
Optimization of fuzzy clustering parameters
The parameters used for the fuzzy clustering were empirically defined for the analysis of yeast genomic expression data. The parameters were optimized to maximally recover clusters identified by hierarchical clustering: these were defined as all hierarchical gene clusters that had a Pearson correlation greater than 0.7 (see Materials and methods); in essence, these clusters served as positive controls. We also assessed the ability of the fuzzy k-means algorithm to identify groups of genes with coherent expression patterns, sets of genes whose products are functionally related, and clusters of known transcription factor targets. A summary of the parameter optimization is discussed below, with additional information available at .
We found that performing three clustering cycles, with the second and third cycles performed on subsets of the data as described in Materials and methods, maximized the recovery of the clusters identified through hierarchical clustering. Performing three cycles to identify k = 100 centroids recovered 79% of the known clusters in the dataset, compared to the case when the clustering was carried out in one round using identical parameters and seed vectors, for which 64% of the known clusters were identified (see ). Performing more than three rounds of clustering did not identify additional known clusters in the dataset, nor did it lead to the increased identification of large clusters of coherently expressed genes (data not shown). We therefore implemented three cycles of clustering, although more sophisticated methods of determining the optimal number of cycles can be envisioned.
A significant challenge in partitioning-clustering techniques is defining the number of clusters, k . With standard implementations of the k-means algorithm, underestimating k will result in large clusters of many genes that display divergent gene-expression patterns, while overestimating k will over-fit the data and split groups of similarly expressed genes into multiple, small clusters. Because of the dependence of k-means clustering on k, a number of methods have been developed to estimate this parameter [16,17]. In contrast, fuzzy k-means clustering appears to be less sensitive to over-fitting, because the genes are not forced to belong to only a single cluster. For example, performing the clustering with k = 300 added only approximately 30 unique centroids relative to when the clustering was performed with k = 120, and otherwise-identical parameters (see ). The relatively small number of centroids added when k was increased to 300 was largely due to the fact that the program identified many more replicates of centroids, which were consequently removed from the final set. Of the approximately 30 added centroids, most appeared to represent local minima, as they were centroids that were poorly reproduced in bootstrapping experiments (see Materials and methods) and identified few genes that had coherent patterns of expression (data not shown). Nonetheless, the addition of these patterns did not significantly affect the relative memberships of genes to the other centroids (data not shown), indicating that overestimating k did not appreciably affect the clustering results. This presents a significant advantage over standard k-means clustering as it reduces the requirement of accurately estimating k by allowing this parameter to be overestimated.
We examined a number of different initialization methods (data not shown) and found that seeding prototype centroids with the eigen vectors identified by PCA performed optimally. Together, the eigen vectors describe the variation in the gene-expression dataset, and therefore seeding the centroids with these vectors provides a systematic method of sampling the data space. In addition, this protocol produces deterministic clustering results, in contrast to the random initialization method commonly implemented in k-means clustering. One potential drawback of this method is that the number of clusters, k, is limited to the number of eigen vectors (which is determined by the number of microarray experiments analyzed). This limitation is alleviated by performing successive cycles of fuzzy k-means clustering on subsets of the data and recalculating the eigen vectors for the respective dataset used in each cycle. In addition, the clustering protocol can incorporate user-defined vectors to seed any number of additional centroids.
Many of the eigen vectors identified by PCA seemed to contain little information about the dataset, as previously noted for this type of analysis [18,19]. Nonetheless, most of the eigen vectors diverged to different gene-expression patterns within 10-15 iterations. The final centroids identified by the fuzzy clustering method showed little dependence on the eigen vectors used to seed the process, as evidenced by bootstrapping analysis. More than 50% of the final centroids were identified in 90% of the bootstrapping trials in which PCA was performed on a random sample of the data, despite the fact the most of the eigen vectors were significantly different in each trial (see Materials and methods). Most of the final centroids bore little similarity to the eigen vectors used to initialize the process, with less than 5% of final centroids similar to any of the eigen vectors with a Pearson correlation greater than 0.7.
Similarly to standard k-means clustering, the results of the fuzzy k-means method were affected by the data context. This was evident by the fact that the recovery of known clusters was enhanced by performing successive rounds of clustering on data subsets, as described above. In addition, the algorithm performed slightly better on an input dataset that consisted of the subset of yeast genes that showed differential expression patterns, as opposed to the entire gene-expression dataset. As the input dataset for the clustering process, we empirically selected genes whose standard deviation of expression was greater than around 1.4 (log20.45) from each gene's expression mean, amounting to approximately 4,400 out of the approximately 6,200 genes. The algorithm performed equally well on input datasets selected by other criteria of differential expression (data not shown). Performing the clustering on data subsets posed no limitation to the method, because at the end of the procedure all genes in the complete dataset were assigned membership values to the superset of identified centroids.
Fuzzy clustering of yeast genomic expression data
Fuzzy clustering identified previously unrecognized gene clusters
In addition to identifying new groups of similarly expressed genes, fuzzy k-means clustering also provided more comprehensive clusters of previously recognized groups of functionally related genes. In many cases, these genes were similarly expressed in only a subset of the experiments, a feature that prevented their association when the data were analyzed with the other clustering methods. An example of this is a centroid that represents genes that were strongly induced by amino-acid starvation (cluster 2 in Figure 5b). Essentially all of the top seven characterized genes associated with this cluster function in methionine biosynthesis and showed similar expression patterns in response to all of the experimental conditions (see ). However, as the membership cutoff was decreased to expand the cluster, additional functionally related genes were included, despite the fact that these genes were divergently expressed in response to conditions other than amino-acid limitation. Of the characterized genes within the top 100 genes belonging to this cluster, 64% (42/66) encode proteins that are directly involved in amino-acid biosynthesis, while more than half of the remaining characterized genes are involved in aspects of nitrogen and carbon metabolism that support amino-acid synthesis. Only half of these genes fell into the same cluster when the data were analyzed with hierarchical clustering or k-means clustering (see Materials and methods), while the remaining genes fell into multiple smaller groups in both cases.
Many genes were assigned to multiple clusters
One of the most significant advantages of fuzzy k-means clustering is that genes can belong to more than one group, revealing distinct aspects of their function and regulation. An illustration is provided by the gene KAR2, which encodes an HSP70 protein-folding chaperone localized to the endo-plasmic reticulum (ER) that is known to respond to defects in ER secretion and to unfolded proteins in this organelle (see  for review). Consistent with the known functions of the protein, KAR2 has significant membership in two clusters. The first (cluster 58 in Figure 5c) includes genes that were induced by the reducing agent DTT, a condition that prevents proper disulfide-bond formation and secretion in the ER . More than 75% (23/31) of the characterized genes within the top 50 genes in this cluster are localized to the ER and participate in various aspects of secretion, including protein folding (KAR2, LHS1, FKB2, JEM1), protein disulfide isomerization (EUG1, PDI1, ERO1), protein glycosylation (GFA1, PMT3, PMI40, SEC59, WBP1, OST2), and forward and retrograde trafficking (ERD2, ERP1, ERP2, SEC24, SEC13, RET2, RET3, and others ). Many of the uncharacterized genes in this group are likely to be functionally related to the characterized genes. In addition, KAR2 also has significant membership in a second cluster (cluster 11 in Figure 5d), which is composed of genes that were induced following heat shock and diamide treatment. Roughly 40% (14/36) of the top characterized genes associated with this group encode protein-folding chaperones localized to different subcellular regions (including those that encode the cytosolic Hsp90 and Hsp70 factors, the mitochondrial Hsp10/Hsp60p and Ssc1p, and the ER- and mitochondrial-associated Ssa1p), and their induction following heat shock and diamide treatment is likely to be in response to widespread protein unfolding inflicted by these conditions. That KAR2 clusters with both groups of genes reflects the dual role of Kar2p in the response to ER-specific challenges and to conditions that generally destabilize proteins throughout the cell, presumably without affecting other aspects of secretion.
The clustering of KAR2 with genes in these two clusters not only reflects the functional role of the encoded protein but also corroborates the conditional regulation of KAR2 expression. In response to defects in ER secretion, KAR2 is known to be induced by the transcription factor Hac1p as part of the unfolded protein response (UPR) [25,26,27,28]. In fact, nearly all of the top 50 genes in cluster 58 were shown by Travers et al.  to be induced following DTT treatment in a manner dependent on Hac1p and its upstream regulator, Ire1p [6,29,30]. However, unlike most of the genes in cluster 58, KAR2 is also induced in response to heat shock, along with the other chaperone genes in cluster 11, by the transcription factor Hsf1p . Consistently, most of the top genes in this group, including KAR2, contain multiple Hsf1p-binding sites in their promoters. The clustering of KAR2 with both clusters of genes therefore reflects the known induction of the gene by Hac1p as part of the UPR but by Hsf1p following heat shock.
Fuzzy assignment of genes to clusters
Number of genes assigned†
Number of genes assigned to >1 cluster‡
Percent assigned genes in >1 cluster§
Fuzzy clusters represent gene targets of yeast transcription factors
At a membership cutoff of 0.08, cluster 45 consisted of many genes involved in the tricarboxylic acid cycle and oxidative phosphorylation, and this group was enriched for the binding site of the Hap2/3/4p complex that is known to regulate the genes' expression. At a slightly lower cutoff of 0.06, additional genes involved in respiration and utilization of alternative carbon sources were assigned to the cluster, making the enrichment of the promoter element recognized by the catabolite repressor Mig1p statistically significant [40,41,42]. At both of these membership cutoffs, this cluster was also highly enriched for the sequence recognized by the stress-responsive factors Msn2p and Msn4p. These factors recognize a sequence that is very similar to the Mig1p-binding site, and it is possible that the enriched sequence actually represents derivative Mig1p elements. However, a specific role for Msn2p in the response to glucose starvation has recently been identified , raising the possibility that the factor is directly involved in regulating these genes. Another cluster (cluster 73) consists largely of genes that were sharply repressed in response to environmental stresses [2,44]. The ribosomal protein genes in this group are regulated by the factor Rap1p and contain multiple copies of its binding site within their promoters [45,46], while other genes in this group contain two putative regulatory sequences that have been previously identified as enriched in the promoters of many of these genes [2,47,48,49,50]. In each of these cases, the comprehensive clusters identified by the fuzzy k-means analysis included additional genes that were not previously known to be targets of these factors. Some of the binding-site occurrences shown in Figure 6 are likely to have occurred by chance (especially for sequences that are common in the genome, such as the Hap2/3/4p-, Mig1p-, and Msn2/Msn4p-binding sites). However, the similarity between the expression patterns of these genes and those of the known transcription factor targets, along with the functional correlation of the gene products and the presence of the respective binding sites in the genes' promoters, strongly suggest that many of these genes are legitimate targets of these regulators.
The overlapping clusters identified by fuzzy k-means clustering presented more complete groups of transcription factor targets compared to other clustering methods, enhancing the identification of promoter elements enriched in the clusters and implicating details of the conditional regulation of gene expression. An example can be seen in a set of around 15 genes that are induced by Yap1p in response to oxidative stress, but by the general stress factors Msn2p and/or Msn4p (Msn2/Msn4p) in response to other stressful conditions . These genes belonged significantly to three different clusters (Figure 6). One cluster (cluster 7) consisted of around 90 known Msn2/Msn4p targets and was enriched for genes whose promoters contain the known Msn2/Msn4p-binding site as well as other C-rich sequences that are similar to, but distinct from, the Msn2/Msn4p site (see below). The second cluster that these genes belonged to (cluster 4) comprised known Yap1p targets, most of which contain the Yap1p-binding site within their promoters. The third cluster (cluster 8) specifically represented the subgroup of genes that are conditionally regulated by Yap1p or Msn2/Msn4p, and this group was enriched for both the Yap1p element and the Msn2/Msn4p-binding site (but not other C-rich sequences). At a lower membership cutoff, additional genes were assigned to cluster 8 that showed similar expression patterns and contain both the Yap1p and Msn2/Msn4p promoter elements, suggesting that these genes may also be conditionally regulated by the factors. In contrast to these results, when the data were analyzed by k-means clustering, these genes could only be assigned to a cluster of Yap1p targets or a cluster of Msn2/Msn4p targets, and therefore no group of genes that was statistically enriched for both of these promoter elements could be identified (data not shown).
The majority of the clusters identified by fuzzy k-means clustering were not statistically enriched for known transcription factor binding sites. To identify novel enriched promoter sequences, we calculated the hypergeometric distribution of all possible 6-mer sequences in the promoters of the genes clustered by fuzzy k-means clustering. Almost all the statistically significant 6-mers represented known transcription factor binding sites, with the exception of a group of C-rich sequences with high statistical enrichment in the promoters of the Msn2/Msn4p targets in cluster 7 (Figure 6). We therefore focused our attention on the newly identified group of cell-wall genes, defined as the top 20 genes in cluster 61. Although none of the 6-mers met the significance cutoff for this cluster (P = 10-6), the most significant sequence (CGCGAA, P = 10-5) was identical to the core binding site of SBF, a transcription factor complex that regulates cell-cycle-dependent gene expression at the G1 to S transition [51,52,53]. In fact, more than two-thirds of these genes were identified by Iyer et al.  as part of a larger set of around 180 genes whose flanking regions were physically bound by the SBF complex. However, this set of genes was not coordinately expressed during cell-cycle progression [54,55], suggesting that these cell-wall genes may be regulated by a distinct mechanism in response to environmental conditions.
Fuzzy clustering uncovered correlations between the environmental conditions
However, when the microarray experiments were clustered on the basis of subsets of genes identified by fuzzy k-means clustering, more detailed correlations emerged, indicating more information about the effects of each environment. An example is the sulfhydryl-oxidizing drug diamide, which affects many aspects of cell biochemistry. When the experiment clustering was performed on the basis of genes encoding protein-folding chaperones (the top 10 genes in cluster 11), a striking similarity between the effects of diamide and heat shock was observed (Figure 8b). In contrast, when the microarray clustering was performed on the basis of genes involved in oxidative stress defense, (the top 24 genes belonging to cluster 4), diamide was most similar to hydrogen peroxide and menadione, which inflict oxidative damage by generating reactive oxygen species (Figure 8c). In terms of the genes induced in the UPR (identified as the top 30 genes or so in cluster 58), the effects of diamide were most similar to those triggered by the reducing agent DTT (Figure 8d). That the effects of diamide were similar to those of different environmental conditions depending on the genes analyzed reflects the diverse effects of this drug on the cell (Figure 8f). By crosslinking protein sulfhydryl groups, diamide is thought to disrupt protein structure and trigger oxidative stress [57,58], both of which are likely to perturb normal ER functions .
The similarities between other environmental conditions are less well understood. For example, limitation of the essential nutrient zinc triggered diverse gene-expression changes in the cell . Although the overall genomic expression program triggered by this condition was distinct, when the experiment clustering was based on genes involved in the use of alternative carbon sources and respiration (cluster 36 and cluster 45, respectively), a significant similarity between the effects of zinc limitation and conditions that involve glucose starvation emerged (Figure 8e). Like carbon starvation, zinc limitation triggered the increased expression of these genes, even though the cells were not limited for glucose (T. Lyons, personal communication). This result suggests that zinc-limited cells may have a defect in glucose metabolism, leading to the induced expression of the respiration and carbon-utilization genes. While a link between zinc limitation and sugar metabolism has been established in mammals , the molecular basis of this correlation is not known. In contrast to this relationship, when the experiment clustering was performed on genes encoding ER-resident proteins, the effects of zinc starvation were most similar to those inflicted by DTT and diamide (Figure 8d), suggesting that zinc limitation may initiate the UPR. Zinc starvation is not known to induce this program in yeast (C. Patil, personal communication). However, a potential connection between zinc and the UPR is the protein calreticulin, an ER protein-folding chaperone that acts on glycosylated proteins . That mammalian calreticulin is a zinc-dependent protein  raises the possibility that zinc limitation prevents the proper activity of this protein in yeast, leading to unfolded ER proteins and triggering the subtle increase in expression of genes that participate in the UPR. The correlations between the gene-expression changes triggered by these conditions suggest hypotheses about the effects of each condition that warrant future experiments.
To respond to diverse and frequently changing environmental conditions, yeast cells must precisely mediate the synthesis and function of the proteins in the cell. This is controlled in part by the overall genomic expression program that results from the combined action of different regulatory factors, each of which responds to specific extra- and intra-cellular signals. Many of these regulators act under specific conditions, and together they govern the expression of overlapping sets of genes. Individual genes, in turn, are regulated by multiple, condition-specific systems that result in each gene being coexpressed with different groups of genes under different situations.
Although examples of this type of regulation have been observed on an individual gene basis, our results suggest that the condition-specific regulation of overlapping sets of yeast genes is a prevalent theme in the regulation of yeast gene expression. A large fraction of yeast genes is expressed in patterns that are similar to different groups of genes in response to different subsets of the experiments (Table 1). Furthermore, a substantial number of these genes contain multiple transcription factor binding sites in their promoters (Figure 6, and see ), consistent with the idea that they are conditionally regulated by multiple, independent regulatory systems. The condition-specific regulation of gene expression has also been implicated in higher organisms [63,64] and probably has a significant role in regulating genomic expression. This is in contrast to the regulatory logic of prokaryotes, in which the expression of defined sets of genes in operons is a predominant feature of regulation. Thus, the conditional regulation of overlapping groups of genes may represent a regulatory theme that is particularly important in eukaryotes.
The prevalence of conditional gene coexpression poses a challenge for the analysis of gene-expression data, because many genes will have expression patterns that are similar to multiple, distinct gene groups. Fuzzy k-means clustering is well suited to identifying conditionally coexpressed genes for a number of reasons. First and foremost, the method can present overlapping clusters, revealing distinct features of each gene's function and regulation. The resulting implications can be used to assign refined hypothetical functions to uncharacterized gene products on the basis of the known functions encoded by the genes in each cluster. In addition, this information can suggest additional cellular roles of well studied proteins (see ). The overlapping clusters identified by fuzzy k-means clustering also present more comprehensive groups of conditionally coregulated genes. This is especially important for the successful identification of regulatory motifs common to the promoters of similarly expressed genes, because motif-finding algorithms are often hindered by small sample sets. More than two-thirds of the gene clusters we identified are not enriched for known regulatory elements, highlighting the potential for discovering novel sequences involved in gene-expression regulation. We expect that fuzzy k-means clustering will advance that discovery, as illustrated by our ability to identify new sequences conserved in the promoters of clustered genes.
Another benefit of the fuzzy k-means algorithm is that it identifies continuous clusters of genes. This allows each cluster to be expanded or collapsed to view genes of varying similarity in expression. While the genes of highest membership in a given cluster are often tightly correlated in terms of biochemical function and regulation, expanding the cluster can identify genes that are similarly expressed in only subsets of the experimental conditions. The resulting gene relationships can suggest details about the cellular roles served by the encoded gene products and the regulatory systems that govern the genes' expression in response to the relevant conditions. Thus, the results of fuzzy k-means clustering are naturally suited for biologists to use in an intuitive and physiologically meaningful way.
The fuzzy k-means algorithm used here was chosen for its conceptual and algorithmic simplicity. There are many alternative algorithms that might accomplish the same ends. For example, Ihmels et al.  have applied a heuristic algorithm to the analysis of yeast gene-expression data to identify overlapping sets of genes whose expression is similar to known gene-expression patterns. This method produced interesting results and identified genes that were similarly expressed to known transcription factor targets. A key difference between these algorithms is that fuzzy k-means clustering requires no a priori information about the dataset. Thus, each method may be suitable for a different biological question, namely identifying genes whose expression is similar to known or expected gene expression patterns versus an unbiased, de novo exploration of the gene-expression dataset.
Despite the advantages of fuzzy k-means clustering discussed above, the method also has a number of limitations. Most notably, the assignment of genes to the clusters requires a user-defined membership cutoff. While this allows complete flexibility in data exploration, selecting meaningful cutoffs is a challenge. Choice of cutoff can be guided by a number of criteria, including the coherence of the selected gene-expression patterns, the functional relationships of the characterized genes selected, or the statistical enrichment of sequences in the selected genes' promoters. We have attempted to alleviate the challenge of selecting cutoffs by providing visualization software specifically designed for the fuzzy clustering results, allowing the gene expression data to be inspected directly and dynamically.
Although the fuzzy k-means clustering method successfully identified nearly 90% of the known clusters in the dataset, it routinely failed to identify a small number of groups that were identified by hierarchical clustering. The inability of the method to find the expression patterns representing these groups seemed to be dependent on the overall properties of the dataset, rather than the absence of an appropriate eigen vector used to initiate the process, as the program was unable to identify these patterns even when the process was initiated by seeding the centroids with the unidentified patterns (data not shown). We have accounted for this limitation by allowing any number of expression patterns to be added to the final list of identified cluster centroids, thereby revealing genes that are similarly expressed to the pattern in question.
Despite these limitations, the unique advantages of fuzzy k-means clustering make the technique a valuable tool for gene-expression analysis. We believe that fuzzy k-means clustering will be a useful complement to other computational methods commonly used to analyze gene-expression data. Whereas algorithms that present discrete gene clusters provide a straightforward method of initial data exploration, the flexibility of fuzzy k-means clustering can be used to reveal more complex correlations between gene-expression patterns, promoting refined hypotheses of the role and regulation of gene-expression changes.
Materials and methods
Software and supplementary information
The clustering software FuzzyK and the visualization program FuzzyExplorer are available from , along with the complete clustering results and additional information.
Published genomic expression data of wild-type S. cerevisiae responding to zinc starvation , phosphate limitation , DNA-damaging agents , and a variety of stressful environmental changes  were combined into a dataset of 6,153 genes and 93 microarray experiments (dataset A). These data were chosen because the experiments were performed using the same experimental and microarray methods . The data were downloaded from the Stanford Microarray Database and were otherwise unprocessed before clustering, with the exception of the heat shock, DTT, and carbon-source experiments, which were transformed as previously described . The complete dataset organized by hierarchical clustering can be downloaded from . A subset of this data was used in the fuzzy k-means clustering and consisted of 4,373 genes whose standard deviation in expression was log2(0.45) from each vector mean (dataset B), identified using the program Cluster .
For all methods discussed, the weighted, uncentered Pearson correlation was used as the similarity metric (referred to simply as the Pearson correlation) . Where noted, the Pearson distance was used, equal to 1 - correlation. The array weights used in the calculation were generated as previously described, using a Pearson correlation cutoff of 0.8 and an exponent of 1 (see  for details).
Average linkage hierarchical clustering of the data was carried out using the program Cluster as previously described, using the weighted, uncentered Pearson correlation as the similarity metric . Dataset A and dataset B were hierarchically clustered with identical parameters, using array weights calculated based on a correlation cutoff of 0.8 and an exponent of 1 . Clustering of the microarray experiments was carried out similarly, using gene weights calculated with a correlation cutoff of 0.7 and an exponent of 1.
To represent all the significant gene clusters identified by hierarchical clustering, the dendrogram generated for dataset B by the Cluster program was parsed, and the average expression patterns of clusters of more than three genes with an average Pearson correlation >0.7 were calculated. Through this method, 38 hierarchical cluster means were identified. The parsing process was repeated to calculate the average expression patterns of clusters of more than three genes with average correlations >0.8 and >0.9. Cluster means not already represented in the initial group of 38 clusters were added to the group, resulting in a total of 53 hierarchical cluster means identified for the dataset. The centroids identified through fuzzy k-means clustering were considered similar to the hierarchical cluster means if the Pearson correlation between the vectors was >0.7.
Fuzzy k-means clustering
We implemented the modified fuzzy k-means method in the C++ program FuzzyK, available at .
The fuzzy k-means algorithm  is based on the minimization of the objective function shown below, for a given fuzzy partition of the data, F, and a set of K cluster centroids, V
where X i is the expression pattern of the ith gene in the dataset, V j is the centroid of the jth cluster, d is the Pearson distance between X i and V j , m XiVj is the membership of X i in cluster V j , N is the number of genes in the dataset, and K is the total number of clusters.
We implemented the algorithm to perform three successive cycles of fuzzy k-means clustering. The first cycle of clustering was initialized by performing PCA on dataset B using the GNU Scientific Library SVD function. Of the top k/3 eigen vectors, those to which no gene had a maximal Pearson correlation were eliminated (for k = 120, only one eigen vector was eliminated in each cycle). The remaining eigen vectors were used as prototype centroids for that clustering cycle. Subsequent cycles of clustering were initialized similarly, except that PCA was performed on the respective data subset used in that clustering cycle.
During the centroid refinement in each clustering cycle, new centroids were calculated on the basis of the weighted mean of all the gene-expression patterns in the dataset according to
where each gene's membership m (a continuous variable from 0 to 1) was defined as
and w was the gene weight: in the first clustering cycle, the gene weights used were those defined by the program Cluster, using a Pearson correlation cutoff of 0.7 and an exponent of 1  and in subsequent cycles the gene weight was empirically defined as
where d Xi, Vj is the Pearson distance between gene X i , and vector V j , c Xi, Xn is the Pearson correlation between genes X i and X n , and x is the correlation cutoff, in this case 0.6. This weighting scheme served to overweight genes that were correlated to other genes in the dataset.
In each clustering cycle, the centroids were iteratively refined until the average change in gene memberships between iterations was <0.001 (approximately 40-60 total iterations in each clustering cycle). While around 85% of the centroids stabilized within approximately 15 iterations, some of the centroids required more than 40 iterations before stabilizing.
After each clustering cycle, the centroids were combined with those identified in previous cycles, and replicate centroids were averaged: each centroid was compared to all other centroids in the set, and centroid pairs correlated >0.9 were replaced by the average of the two vectors. The new vector was compared to the remaining centroids in the set and averaged with those to which it was correlated >0.9. This process continued until each centroid (or the vector that replaced it) was compared to all other existing centroids in the set.
Following the first and second clustering cycles, data subsets were selected to apply to subsequent rounds of clustering. Genes that were correlated to any existing centroid with a Pearson correlation >0.7 were removed from the dataset, and array and gene weights were recalculated on the data subset as described above. The new data subset was applied to a subsequent cycle of clustering, performed as described above.
Final gene-cluster assignments
The centroids identified through three rounds of fuzzy clustering were combined into one set and replicate centroids were averaged, as described above. Each gene in dataset A was assigned a membership score to each of the unique centroids. For display in the figures, the final list of centroids was ordered by hierarchical clustering. Genes were selected in each cluster if their membership score was greater than the empirically determined membership cutoff applied to each cluster. For display in Figure 5, the genes selected in each cluster were subsequently organized by hierarchical clustering.
For an optimal comparison of the results of k-means and fuzzy k-means clustering, we performed the k-means clustering identically to the fuzzy k-means protocol, except that during the clustering iterations each gene contributed only to the cluster to which it was most similar (with a membership of 1.0). Three rounds of hard k-means clustering were performed with k = 120, and each cycle was initiated by seeding k/3 centroids with the most informative k/3 eigen vectors identified by PCA, as described above. The process for merging centroids, selecting the data subsets for subsequent clustering rounds, and gene and array weighting were carried out identically as described for fuzzy k-means clustering. After identification of the final set of centroids, each gene was assigned only to the centroid to which it was most similar.
To estimate the dependence of the procedure on the initial dataset, a bootstrapping method was applied in which the fuzzy k-means protocol was repeated 100 times, each time on 4,373 genes chosen randomly from dataset B, with k = 102. The occurrence of each centroid in the bootstrap trials was determined by summing the number of trials that contained a centroid that was correlated >0.7 to the centroid in question. By this criterion, roughly 50% of the centroids were identified in 90% of the trials, while approximately 25% of the centroids were identified in all of the trials.
To estimate the dependence of the procedure on the eigen vectors used to seed the clusters, a similar bootstrapping procedure was carried out in which PCA was performed on 4,373 genes chosen randomly from dataset B but the cluster refinement was done using all genes in dataset B. The frequency of each centroid was scored as described above. The results were similar to the previous bootstrapping experiment, with around 50% of the centroids present in 90% of the bootstrapping trials, and around 25% of the centroids identified in all the trials.
Genes were assigned to all the 91 identified centroids on the basis of a membership cutoff of 0.06 or 0.08. The statistical enrichment of each cluster for genes that contained known transcription factor binding sites or different 6-mer sequences within 800 bp upstream of the ORF was assessed, according to the hypergeometric distribution. The probability of observing at least q genes that contained one or more copies of a given sequence out of l genes in a fuzzy cluster was calculated as
where M is the number of genes in the genome that contain the motif and N is the total number of genes in the genome. Forty-three transcription factor binding sites were compiled from the literature (see  for a complete list of sequences). The enrichment of each sequence was considered significant if the P value was <0.01 divided by the number of elements searched, or 2 × 10-4 for the 43 transcription factor binding sites and 2 × 10-6 for the 4,096 different 6-mers.
The program MEME  was seeded with the most significant 6-mer (CGCGAA) enriched in the promoters of the genes selected for cluster 61, and the program was run with a variety of parameters (the parameters and MEME output can be found at ). Genes whose promoters contained significant matches to the identified matrices were identified using Patser on the RSA tools website [68,69].
We thank A. Moses and E. Kelley for helpful suggestions and programming assistance, A. Alizadeh, J. Bolderick, N. Ogawa, C. Patil, C. Rees, P. Spellman, and M. Kamvysselis for helpful discussions, and A. Moses, D. Chiang, and J. Fay for critical reading of the manuscript. A.P.G. is supported by an NSF postdoctoral fellowship in biological informatics, and M.B.E. is a Pew Scholar in the Biomedical Sciences. This work was conducted under the US Department of Energy contract No. ED-AC03-76SF00098.
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11: 4241-4257.PubMedPubMed CentralView ArticleGoogle Scholar
- Yun CW, Ferea T, Rashford J, Ardon O, Brown PO, Botstein D, Kaplan J, Philpott CC: Desferrioxamine-mediated iron uptake in Saccharomyces cerevisiae. Evidence for two pathways of iron uptake. J Biol Chem. 2000, 275: 10709-10715. 10.1074/jbc.275.14.10709.PubMedView ArticleGoogle Scholar
- Lyons TJ, Gasch AP, Gaither LA, Botstein D, Brown PO, Eide DJ: Genome-wide characterization of the Zap1p zinc-responsive regulon in yeast. Proc Natl Acad Sci USA. 2000, 97: 7957-7962. 10.1073/pnas.97.14.7957.PubMedPubMed CentralView ArticleGoogle Scholar
- Ogawa N, DeRisi J, Brown PO: New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis. Mol Biol Cell. 2000, 11: 4309-4321.PubMedPubMed CentralView ArticleGoogle Scholar
- Travers KJ, Patil CK, Wodicka L, Lockhart DJ, Weissman JS, Walter P: Functional and genomic analyses reveal an essential coordination between the unfolded protein response and ER-associated degradation. Cell. 2000, 101: 249-258.PubMedView ArticleGoogle Scholar
- MacQueen J: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability. Edited by: Le Cam L, Neyman J. 1967, Berkeley: University of California, Berkeley Press, 281-297.Google Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22: 281-285. 10.1038/10343.PubMedView ArticleGoogle Scholar
- Fuzzy models for Pattern Recognition. Edited by: Bezdek JC, Pal SK. 1992, New York: IEEE Press
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601-620. 10.1089/106652700750050961.PubMedView ArticleGoogle Scholar
- Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nat Genet. 2002, 31: 370-377.PubMedGoogle Scholar
- Bezdek JC: Fuzzy Mathematics in Pattern Classification. 1973, Ithaca, NY: Cornell UniversityGoogle Scholar
- Gath I, Geva AB: Unsupervised optimal fuzzy clustering. Trans Pattern Analysis Machine Intell. 1989, 11: 773-781. 10.1109/34.192473.View ArticleGoogle Scholar
- FuzzyK. [http://rana.lbl.gov/FuzzyK]
- Aldenderfer MS, Blashfield RK: Cluster analysis. In Quantitative Applications in the Social Sciences. Edited by: Lewis-Beck MS. 1984, Newbury Park: Sage, 88:Google Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. Techn Rep. 1998, 329: 1-19.Google Scholar
- Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 2002, 3: research0036.1-0036.21. 10.1186/gb-2002-3-7-research0036.View ArticleGoogle Scholar
- Raychaudhuri S, Stuart JM, Altman RB: Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput. 2000, 455-466.Google Scholar
- Alter O, Brown PO, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA. 2000, 97: 10101-10106. 10.1073/pnas.97.18.10101.PubMedPubMed CentralView ArticleGoogle Scholar
- Gasch AP, Huang M, Metzner S, Botstein D, Elledge SJ, Brown PO: Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol Biol Cell. 2001, 12: 2987-3003.PubMedPubMed CentralView ArticleGoogle Scholar
- Saccharomyces Genome Database. [http://genome-www.stanford.edu/Saccharomyces/]
- Nielsen H, Engelbrecht J, Brunak S, von Heijne G: A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst. 1997, 8: 581-599. 10.1142/S0129065797000537.PubMedView ArticleGoogle Scholar
- Gething MJ: Role and regulation of the ER chaperone BiP. Semin Cell Dev Biol. 1999, 10: 465-472. 10.1006/scdb.1999.0318.PubMedView ArticleGoogle Scholar
- Jamsa E, Simonen M, Makarow M: Selective retention of secretory proteins in the yeast endoplasmic reticulum by treatment of cells with a reducing agent. Yeast. 1994, 10: 355-370.PubMedView ArticleGoogle Scholar
- Cox JS, Walter P: A novel mechanism for regulating activity of a transcription factor that controls the unfolded protein response. Cell. 1996, 87: 391-404.PubMedView ArticleGoogle Scholar
- Mori K, Kawahara T, Yoshida H, Yanagi H, Yura T: Signalling from endoplasmic reticulum to nucleus: transcription factor with a basic-leucine zipper motif is required for the unfolded protein - response pathway. Genes Cells. 1996, 1: 803-817. 10.1046/j.1365-2443.1996.d01-274.x.PubMedView ArticleGoogle Scholar
- Nikawa J, Akiyoshi M, Hirata S, Fukuda T: Saccharomyces cerevisiae IRE2/HAC1 is involved in IRE1-mediated KAR2 expression. Nucleic Acids Res. 1996, 24: 4222-4226. 10.1093/nar/24.21.4222.PubMedPubMed CentralView ArticleGoogle Scholar
- Chapman R, Sidrauski C, Walter P: Intracellular signaling from the endoplasmic reticulum to the nucleus. Annu Rev Cell Dev Biol. 1998, 14: 459-485. 10.1146/annurev.cellbio.14.1.459.PubMedView ArticleGoogle Scholar
- Cox JS, Shamu CE, Walter P: Transcriptional induction of genes encoding endoplasmic reticulum resident proteins requires a transmembrane protein kinase. Cell. 1993, 73: 1197-1206.PubMedView ArticleGoogle Scholar
- Mori K, Ma W, Gething MJ, Sambrook J: A transmembrane protein with a cdc2+/CDC28-related kinase activity is required for signaling from the ER to the nucleus. Cell. 1993, 74: 743-756.PubMedView ArticleGoogle Scholar
- Oka M, Kimata Y, Mori K, Kohno K: Saccharomyces cerevisiae KAR2 (BiP) gene expression is induced by loss of cytosolic HSP70/Ssa1p through a heat shock element-mediated pathway. J Biochem (Tokyo). 1997, 121: 578-584.View ArticleGoogle Scholar
- Arndt KT, Styles C, Fink GR: Multiple global regulators control HIS4 transcription in yeast. Science. 1987, 237: 874-880.PubMedView ArticleGoogle Scholar
- Daignan-Fornier B, Fink GR: Coregulation of purine and histidine biosynthesis by the transcriptional activators BAS1 and BAS2. Proc Natl Acad Sci USA. 1992, 89: 6746-6750.PubMedPubMed CentralView ArticleGoogle Scholar
- Jia MH, Larossa RA, Lee JM, Rafalski A, Derose E, Gonye G, Xue Z: Global expression profiling of yeast treated with an inhibitor of amino acid biosynthesis, sulfometuron methyl. Physiol Genomics. 2000, 3: 83-92.PubMedGoogle Scholar
- Natarajan K, Meyer MR, Jackson BM, Slade D, Roberts C, Hinnebusch AG, Marton MJ: Transcriptional profiling shows that Gcn4p is a master regulator of gene expression during amino acid starvation in yeast. Mol Cell Biol. 2001, 21: 4347-4368. 10.1128/MCB.21.13.4347-4368.2001.PubMedPubMed CentralView ArticleGoogle Scholar
- Cai M, Davis RW: Yeast centromere binding protein CBF1, of the helix-loop-helix protein family, is required for chromosome stability and methionine prototrophy. Cell. 1990, 61: 437-446.PubMedView ArticleGoogle Scholar
- Thomas D, Jacquemin I, Surdin-Kerjan Y: MET4, a leucine zipper protein, and centromere-binding factor 1 are both required for transcriptional activation of sulfur metabolism in Saccharomyces cerevisiae. Mol Cell Biol. 1992, 12: 1719-1727.PubMedPubMed CentralView ArticleGoogle Scholar
- Blaiseau PL, Isnard AD, Surdin-Kerjan Y, Thomas D: Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism. Mol Cell Biol. 1997, 17: 3640-3648.PubMedPubMed CentralView ArticleGoogle Scholar
- Blaiseau PL, Thomas D: Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA. EMBO J. 1998, 17: 6327-6336. 10.1093/emboj/17.21.6327.PubMedPubMed CentralView ArticleGoogle Scholar
- Nehlin JO, Ronne H: Yeast MIG1 repressor is related to the mammalian early growth response and Wilms' tumour finger proteins. EMBO J. 1990, 9: 2891-2898.PubMedPubMed CentralGoogle Scholar
- de Winde JH, Grivell LA: Global regulation of mitochondrial biogenesis in Saccharomyces cerevisiae. Prog Nucleic Acid Res Mol Biol. 1993, 46: 51-91.PubMedView ArticleGoogle Scholar
- Gancedo JM: Yeast carbon catabolite repression. Microbiol Mol Biol Rev. 1998, 62: 334-361.PubMedPubMed CentralGoogle Scholar
- Gorner W, Durchschlag E, Wolf J, Brown EL, Ammerer G, Ruis H, Schuller C: Acute glucose starvation activates the nuclear localization signal of a stress-specific yeast transcription factor. EMBO J. 2002, 21: 135-144. 10.1093/emboj/21.1.135.PubMedPubMed CentralView ArticleGoogle Scholar
- Causton HC, Ren B, Koh SS, Harbison CT, Kanin E, Jennings EG, Lee TI, True HL, Lander ES, Young RA: Remodeling of yeast genome expression in response to environmental changes. Mol Biol Cell. 2001, 12: 323-337.PubMedPubMed CentralView ArticleGoogle Scholar
- Mager WH, Planta RJ: Multifunctional DNA-binding proteins mediate concerted transcription activation of yeast ribosomal protein genes. Biochim Biophys Acta. 1990, 1050: 351-355. 10.1016/0167-4781(90)90193-6.PubMedView ArticleGoogle Scholar
- Moehle CM, Hinnebusch AG: Association of RAP1 binding sites with stringent control of ribosomal protein gene transcription in Saccharomyces cerevisiae. Mol Cell Biol. 1991, 11: 2723-2735.PubMedPubMed CentralView ArticleGoogle Scholar
- Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol. 2000, 296: 1205-1214. 10.1006/jmbi.2000.3519.PubMedView ArticleGoogle Scholar
- Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using correlation with expression. Nat Genet. 2001, 27: 167-171. 10.1038/84792.PubMedView ArticleGoogle Scholar
- Fazzio TG, Kooperberg C, Goldmark JP, Neal C, Basom R, Delrow J, Tsukiyama T: Widespread collaboration of Isw2 and Sin3-Rpd3 chromatin remodeling complexes in transcriptional repression. Mol Cell Biol. 2001, 21: 6450-6460. 10.1128/MCB.21.19.6450-6460.2001.PubMedPubMed CentralView ArticleGoogle Scholar
- Kurdistani SK, Robyr D, Tavazoie S, Grunstein M: Genome-wide binding map of the histone deacetylase Rpd3 in yeast. Nat Genet. 2002, 31: 248-254. 10.1038/ng907.PubMedView ArticleGoogle Scholar
- Andrews BJ, Herskowitz I: The yeast SWI4 protein contains a motif present in developmental regulators and is part of a complex involved in cell-cycle-dependent transcription. Nature. 1989, 342: 830-833. 10.1038/342830a0.PubMedView ArticleGoogle Scholar
- Andrews BJ, Herskowitz I: Identification of a DNA binding factor involved in cell-cycle control of the yeast HO gene. Cell. 1989, 57: 21-29.PubMedView ArticleGoogle Scholar
- Koch C, Moll T, Neuberg M, Ahorn H, Nasmyth K: A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase. Science. 1993, 261: 1551-1557.PubMedView ArticleGoogle Scholar
- Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO: Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature. 2001, 409: 533-538. 10.1038/35054095.PubMedView ArticleGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.PubMedPubMed CentralView ArticleGoogle Scholar
- Bailey TB, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. 1994, Menlo Park, CA: AAAI PressGoogle Scholar
- Kosower NS, Kosower EM: Diamide: an oxidant probe for thiols. Methods Enzymol. 1995, 251: 123-133.PubMedView ArticleGoogle Scholar
- Shenton D, Perrone G, Quinn KA, Dawes IW, Grant CM: Regulation of protein S-thiolation by glutaredoxin 5 in the yeast Saccharomyces cerevisiae. J Biol Chem. 2002, 277: 16853-16859. 10.1074/jbc.M200559200.PubMedView ArticleGoogle Scholar
- Tortorella D, Story CM, Huppa JB, Wiertz EJ, Jones TR, Bacik I, Bennink JR, Yewdell JW, Ploegh HL: Dislocation of type I membrane proteins from the ER to the cytosol is sensitive to changes in redox potential. J Cell Biol. 1998, 142: 365-376. 10.1083/jcb.142.2.365.PubMedPubMed CentralView ArticleGoogle Scholar
- Salgueiro MJ, Krebs N, Zubillaga MB, Weill R, Postaire E, Lysionek AE, Caro RA, De Paoli T, Hager A, Boccio J: Zinc and diabetes mellitus: is there a need of zinc supplementation in diabetes mellitus patients?. Biol Trace Elem Res. 2001, 81: 215-228. 10.1385/BTER:81:3:215.PubMedView ArticleGoogle Scholar
- Hebert DN, Simons JF, Peterson JR, Helenius A: Calnexin, calreticulin, and Bip/Kar2p in protein folding. Cold Spring Harb Symp Quant Biol. 1995, 60: 405-415.PubMedView ArticleGoogle Scholar
- Baksh S, Spamer C, Heilmann C, Michalak M: Identification of the Zn2+ binding region in calreticulin. FEBS Lett. 1995, 376: 53-57. 10.1016/0014-5793(95)01246-4.PubMedView ArticleGoogle Scholar
- Cox LS: Multiple pathways control cell growth and transformation: overlapping and independent activities of p53 and p21Cip1/WAF1/Sdi1. J Pathol. 1997, 183: 134-140. 10.1002/(SICI)1096-9896(199710)183:2<134::AID-PATH960>3.0.CO;2-D.PubMedView ArticleGoogle Scholar
- Pritts T, Hungness E, Wang Q, Robb B, Hershko D, Hasselgren PO: Mucosal and enterocyte IL-6 production during sepsis and endotoxemia - role of transcription factors and regulation by the stress response. Am J Surg. 2002, 183: 372-383. 10.1016/S0002-9610(02)00812-7.PubMedView ArticleGoogle Scholar
- Thomas D, Surdin-Kerjan Y: Metabolism of sulfur amino acids in Saccharomyces cerevisiae. Microbiol Mol Biol Rev. 1997, 61: 503-532.PubMedPubMed CentralGoogle Scholar
- Eisen lab. [http://rana.lbl.gov]
- Cluster and TreeView Manual. [http://rana.lbl.gov/manuals/ClusterTreeView.pdf]
- Regulatory Sequence Analysis Tools. [http://rsat.ulb.ac.be/rsat/]
- van Helden J, Andre B, Collado-Vides J: A web site for the computational analysis of yeast regulatory sequences. Yeast. 2000, 16: 177-187. 10.1002/(SICI)1097-0061(20000130)16:2<177::AID-YEA516>3.0.CO;2-9.PubMedView ArticleGoogle Scholar