A prediction-based resampling method for estimating the number of clusters in a dataset

Dudoit, Sandrine; Fridlyand, Jane

doi:10.1186/gb-2002-3-7-research0036

Research
Published: 25 June 2002

A prediction-based resampling method for estimating the number of clusters in a dataset

Sandrine Dudoit^1,3 &
Jane Fridlyand^2,3

Genome Biology volume 3, Article number: research0036.1 (2002) Cite this article

27k Accesses
400 Citations
6 Altmetric
Metrics details

Abstract

Background

Microarray technology is increasingly being applied in biological and medical research to address a wide range of problems, such as the classification of tumors. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: to estimate the number of clusters, if any, in a dataset; and to allocate tumor samples to these clusters, and assess the confidence of cluster assignments for individual samples. Here we address the first of these problems.

Results

We have developed a new prediction-based resampling method, Clest, to estimate the number of clusters in a dataset. The performance of the new and existing methods were compared using simulated data and gene-expression data from four recently published cancer microarray studies. Clest was generally found to be more accurate and robust than the six existing methods considered in the study.

Conclusions

Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters.

Background

The burgeoning field of genomics, and in particular DNA microarray experiments, has revived interest in cluster analysis by raising new methodological and computational challenges. DNA microarrays are part of a new and promising class of biotechnologies that allow the monitoring of expression levels in cells for thousands of genes simultaneously. Microarray experiments are increasingly being caried out in biological and medical research to address a wide range of problems, including the classification of tumors [1,2,3,4,5,6]. A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. By allowing the monitoring of expression levels on a genomic scale, microarray experiments may lead to a more complete understanding of the molecular variations among tumors and hence to a finer and more reliable classification. An important statistical problem associated with tumor classification is the identification of new tumor classes using gene-expression profiles. Two essential aspects of this clustering problem are: first, to accurately estimate the number of clusters, if any, in a dataset; and second, to allocate tumor samples accurately to these clusters, and assess the confidence of cluster assignments for individual samples. In a clinical application of microarray-based cancer diagnosis, the definition of new tumor classes would be based on the clustering results, and these classes would then be used to build predictors for new tumor samples. Inaccurate cluster assignments could lead to erroneous diagnoses and unsuitable treatment protocols.

Here we address the estimation of the number of clusters in a dataset. First, we describe the basic principles of cluster analysis and review existing methods for estimating the number of clusters. We then present a new prediction-based resampling method, Clest, for estimating the number of clusters in a dataset. The performance of the new and existing methods is compared using simulated data and gene-expression data from four recently published cancer microarray studies. We have addressed the problem of improving and assessing the accuracy of a given clustering procedure in [7].

Cluster analysis

In classification, one is concerned with assigning objects to classes on the basis of measurements made on these objects. There are two main aspects to classification: discrimination and clustering, or supervised and unsupervised learning. In unsupervised learning (also known as cluster analysis, class discovery and unsupervised pattern recognition), the classes are unknown a priori and need to be discovered from the data. In contrast, in supervised learning (also known as discriminant analysis, class prediction, and supervised pattern recognition), the classes are predefined and the task is to understand the basis for the classification from a set of labeled objects (training or learning set). This information is then used to classify future observations. The present article focuses on the unsupervised problem, that is, on cluster analysis, but draws on notions from supervised learning to address the problem.

In cluster analysis, the data are assumed to be sampled from a mixture distribution with K components corresponding to the K clusters to be recovered. Let (X₁, ..., X_p) denote a random 1 × p vector of explanatory variables or features, and let Y {1, ..., K} denote the unknown component or cluster label. Given a sample of X values, the goal is to estimate the number of clusters K and to estimate, for each observation, its cluster label Y.

Suppose we have data X = (x_ij) on p explanatory variables (for example, genes) for n observations (for example, tumor mRNA samples), where x_ij denotes the realization of variable X_j for observation i and x_i = (x_i1,...,.x_ip) denotes the data vector for observation i, i = 1,...,n, j = 1,...,p. We consider clustering procedures that partition the learning set = {x₁,...,x_n} into K clusters of observations that are 'similar' to each other, where K is a user-prespecified integer. More specifically, the clustering (·;) assigns class labels (X_i;) to each observation, where Clustering procedures generally operate on a matrix of pairwise dissimilarities (or similarities) between the observations to be clustered, such as the Euclidean or Manhattan distance matrices [8]. A partitioning of the learning set can be produced directly by partitioning clustering methods (for example, k-means, partitioning around medoid (PAM), self-organizing maps (SOM)) or by hierarchical clustering methods, by 'cutting' the dendrogram to obtain K 'branches' or clusters. Important issues, which will only be addressed briefly in this article, include: the selection of observational units, the selection of variables for defining the groupings, the transformation and standardization of variables, the choice of a similarity or dissimilarity measure, and the choice of a clustering method [9]. Our main concern here is to estimate the number of clusters K.

When a clustering algorithm is applied to a set of observations, a partition of the data is returned whether or not the data show a true clustering structure, that is, whether or not K = 1. This fact causes no problems if clustering is done to obtain a practical grouping of the given set of objects, as for organizational or visualization purposes (for example, hierarchical clustering for displaying large gene-expression data matrices as in Eisen et al. [10]). However, if interest lies primarily in the recognition of an unknown classification of the data, an artificial clustering is not satisfactory, and clusters resulting from the algorithm must be investigated for their relevance and reproducibility. This task can be carried out by descriptive and graphical exploratory methods, or by relying on probabilistic models and suitable statistical significance tests (for example [11,12]).

We argue here that validating the results of a clustering procedure can be done effectively by focusing on prediction accuracy. Once new classes are identified and class labels are assigned to the observations, the next step is often to build a classifier for predicting the class of future observations. The reproducibility or predictability of cluster assignments becomes very important in this context, and therefore provides a motivation for using ideas from supervised learning in an unsupervised setting. Resampling methods such as bagging [13] and boosting [14,15] have been applied successfully in the field of supervised learning to improve prediction accuracy. We propose here a novel resampling method, Clest, which combines ideas from discriminant and cluster analysis for estimating the number of clusters in a dataset. Although the proposed resampling methods are applicable to general clustering problems and procedures, particular attention is given to the clustering of tumors on the basis of gene-expression data using the partitioning around medoids (PAM) procedure (see below).

Partitioning around medoids

The new Clest procedure is demonstrated using the PAM method of Kaufman and Rousseeuw [16]. As implemented in the cluster package in R and S-Plus, the PAM function takes as its arguments a dissimilarity matrix (for example the Euclidean distance matrix as used here) and a prespecified number of clusters K. The PAM procedure is based on the search for K representative objects, or medoids, among the observations to be clustered. After finding a set of K medoids, K clusters are constructed by assigning each observation to the nearest medoid. The goal is to find K medoids that minimize the sum of the dissimilarities of the observations to their closest medoid. The algorithm first looks for a good initial set of medoids, then finds a local minimum for the objective function, that is, a solution such that there is no single switch of an observation with a medoid that will decrease the objective.

The PAM method tends to be more robust and computationally efficient than k-means. In addition, PAM provides a graphical display, the silhouette plot, which can be used to select the number of clusters and to assess how well individual observations are clustered. Let a_i denote the average dissimilarity between i and all other observations in the cluster to which i belongs. For any other cluster C, let d(i,C) denote the average dissimilarity of i to all objects of C and let b_i denote the smallest of these d(i,C). The silhouette width of observation i is sil_i = (b_i - a_i)/max(a_i,b_i) and the overall average silhouette width is simply the average of sil_i over all observations i, = ∑_isil_i/n. Intuitively, objects with large silhouette width sil_i, are well clustered, whereas those with small sil_i, tend to lie between clusters. Kaufman and Rousseeuw suggest estimating the number of clusters K by that which gives the largest average silhouette width, .

Existing methods for estimating the number of clusters in a dataset

Null hypothesis

Suppose that the maximum possible number of clusters in the data is set to M, 2 ≤ M ≤ n. One approach to estimating the number of clusters K is to look for , 1 < ≤ M, that provides the strongest significant evidence against the null hypothesis H₀ of K = 1, that is, 'no clusters' in the data. Two commonly used parametric null hypotheses are the unimodality hypothesis and the uniformity hypothesis.

Under the unimodality hypothesis, the data are thought to be a random sample from a multivariate normal distribution. This model typically gives a high probability of rejection of the null K = 1 if the data are sampled from a distribution with a lower kurtosis than the normal distribution, such as the uniform distribution [17].

The uniformity hypothesis, also referred to as random position hypothesis, states that the data are sampled from a uniform distribution in p-dimensional space [18,19,20]. Methods based on the uniformity hypothesis tend to be conservative, that is, lead to few rejections of the null hypothesis, when the data are sampled from a strongly unimodal distribution such as the normal distribution. In two or more dimensions, and depending on the test statistic, the results can be very sensitive to the region of support of the reference distribution [17].

For both types of hypotheses, evidence against the null hypothesis can be summarized formally under probability models for the data or more informally by using internal indices as described next.

Internal indices

Numerous methods have been proposed for testing the null hypothesis K = 1 and estimating the number of clusters in a dataset, however, none of them is completely satisfactory. Jain and Dubes [20] provide a general overview of such methods. The majority of existing approaches do not attempt to formally test the null hypothesis that K = 1, but rather look for the clustering structure under which a summary statistic of interest is optimal, being large or small depending on the statistic [21,22,23]. These statistics are typically functions of the within-clusters, and possibly between-clusters, sums of squares. They are referred to as internal indices, in the sense that they are computed from the same observations that are used to create the clustering. Consequently, the distribution of these indices is intractable. In particular, as clustering methods attempt to maximize the separation between clusters, the ordinary significance tests such as analysis of variance F-tests are not valid for testing differences between the clusters. Milli-gan and Cooper [12] conducted an extensive Monte Carlo evaluation of 30 internal indices. Other approaches include modeling the data using Gaussian mixtures and applying a Bayesian criterion to determine the number of components in the mixture [11]. A recent proposal of Tib-shirani et al. [24], called the gap statistic method, calibrates an internal index, such as the within-clusters sum of squares, against its expectation under a suitably defined null hypothesis (note that gap tests have been used in another context in cluster analysis by Bock [18] to test the null hypothesis of a 'homogeneous' population against the alternative of 'heterogeneity'). Tibshirani et al. carried out a comparative Monte Carlo study of the gap statistic and several of the internal indices that showed a better performance in the study of Milligan and Cooper [12]. These internal indices and the gap statistic are described in more detail below.

For a given partition of the learning set into 1 ≤ k ≤ M clusters, define B_k and W_k to be the p × p matrices of between and within k-clusters sums of squares and cross-products [8]. Note that B₁ is not defined. The following six internal indices are commonly used to estimate the number of clusters in a dataset.

sil: Kaufman and Rousseeuw [16] suggest selecting the number of clusters k ≥ 2 which gives the largest average silhouette width, _k. Silhouette widths were defined above with the clustering procedure PAM.

ch: Calinski and Harabasz [21]. For each number of clusters k ≥ 2, define the index

where tr denotes the trace of a matrix, that is, the sum of the diagonal entries. The estimated number of clusters is argmax_{k ≥ 2}ch_k.

kl: Krzanowski and Lai [23]. For each number of clusters k ≥ 2, define the indices

diff_k = (k - 1)^2/p trW_k-1 - k^2/p trW_k and

kl_k = |diff_k|/|diff_k+1|.

The estimated number of clusters is argmax_{k ≥ 2}kl_k.

hart: Hartigan [25]. For each number of clusters k ≥ 1, define the index

The estimated number of clusters is the smallest k ≥ 1 such that hart_k ≤ 10.

gap or gapPC: Tibshirani et al. [24]. This method compares an observed internal index, such as the within-clusters sum of squares, to its expectation under a reference null distribution as follows. For each number of clusters k ≥ 1, compute the within-clusters sum of squares trW_k. Generate B (here B = 10) reference datasets under the null distribution and apply the clustering algorithm to each, calculating the within-clusters sums of squares trW_k¹, ..., trW_k^B. Compute the estimated gap statistic

and the standard deviation sd_k of log trW_k^b, 1 ≤ b ≤ B. Let . The estimated number of clusters is the smallest k ≥ 1 such that , where k^* = argmax_k≥1gap_k.

Tibshirani et al. [24] chose the uniformity hypothesis to create a reference null distribution and considered two approaches for constructing the region of support of the distribution. In the first approach, the sampling window for the jth variable, 1 ≤ j ≤ p, is the range of the observed values for that variable. In the second approach, following Sarle [17], the variables are sampled from a uniform distribution over a box aligned with the principal components of the centered design matrix (that is, the columns of X are first set to have mean 0 and the singular value decomposition of X is computed). The new design matrix is then back-transformed to obtain a reference dataset. Whereas the first approach has the advantage of simplicity, the second takes into account the shape of the data distribution. Note that in both approaches the variables are sampled independently. The version of the gap method that uses the original variables to construct the region of support is referred to as gap and the second version as gapPC, where 'PC' stands for principal components.

Note that of the above methods, only hart, gap, and gapPC allow the estimation of only one cluster in the data, that is,

= 1.

External indices

The term 'validation of a clustering procedure' usually refers to the ability of a given method to recover the true clustering structure in a dataset. There have been several attempts to assess validity on theoretical grounds [18,25]; however, such approaches turn out to be of little applicability in the context of high-dimensional complex datasets. In many validation studies, clustering methods are evaluated on their performance on empirical datasets with a priori known cluster labels [25] or, more commonly, on simulation studies where true cluster labels are known. To assess the ability of a clustering procedure to recover true cluster labels it is necessary to define a measure of agreement between two partitions; the first partition being the a priori known clustering structure of the data, and the second partition resulting from the clustering procedure. In the clustering literature, measures of agreement between partitions are referred to as external indices; several such indices are reviewed next.

Consider two partitions of n objects x₁, ..., x_n: the R-class partition = {u₁, ..., u_R} and the C-class partition = {v₁, ...,v_C}. External indices of partitional agreement can be expressed in terms of a contingency table (Table 1), with entry n_ij denoting the number of objects that are both in clusters u_i and v_j, i = 1,...,R, j = 1,...,C [20]. Let and denote the row and column sums of the contingency table, respectively, and let .

Table 1 Contingency table for two partitions of n objects

Full size table

The following indices can then be used.

1.
Rand: Rand [26]

2.
Jaccard: Jain and Dubes [20]

3.
FM: Fowlkes and Mallows [27]

Note that Rand and FM are linear functions of Z, and hence are linear functions of one another, conditional on the row and column sums in Table 1. If the row and column sums in Table 1 are fixed, but the partitions are selected at random; that is, if there is independence in the table, the hypergeometric distribution can be applied to determine the expected value of quantities such as Z. In particular

An external index S is often standardized in such a way that its expected value is 0 when the partitions are selected at random and 1 when they match perfectly. This amounts to computing a standardized external index

where S_max is the maximum value of the statistic S and E(S) is the expected value of S when partitions are selected at random. Accordingly, an often used correction for the Rand statistic is

The significance of an observed external index is usually assessed under the assumption that the two partitions to be compared are independent. This assumption does not hold for the resampling methods described in the following section, since the same data are used to produce the two partitions. Nevertheless, external indices are convenient tools for comparing two clusterings, and are used in the new resampling method Clest. In this context, one should think of these indices as internal rather than external measures.

Results

Clest, a prediction-based resampling method for estimating the number of clusters

We propose a new prediction-based resampling method, Clest, for estimating the number of clusters, if any, in a dataset. The idea behind Clest is very intuitive if one is concerned with reproducibility or predictability of cluster assignments.

It is proposed to estimate the number of clusters K by repeatedly randomly dividing the original dataset into two non-overlapping sets, a learning set ^b and a test set . For each iteration and for each number of clusters k, a clustering (·;^b) of the learning set ^b is obtained and a predictor C(·;^b) is built using the class labels from the clustering. The predictor C(·;^b) is then applied to the test set and the predicted labels are compared to those produced by applying the clustering procedure to the test set, using one of the external indices (or similarity statistics) described in the Background section. The number of clusters is estimated by comparing the observed similarity statistic for each k to its expected value under a suitable null distribution with K = 1. The estimated number of clusters is defined to be the corresponding to the largest significant evidence against the null hypothesis of K = 1.

An early version of this approach was introduced by Breckenridge [28] under the name of replication analysis and was designed to evaluate the stability of a clustering. In the original replication analysis, the number of clusters k is fixed, and the data are randomly divided into two samples. A clustering procedure partitions both samples into k clusters, and the centroids of the clusters of the first sample are computed. A second set of labels is assigned to the observations in the second sample by assigning to each observation the cluster label of the closest centroid from the first sample. Finally, an external index is used to assess the agreement between the two partitions of the second sample. This measure reflects the stability of the clustering structure. The Clest procedure proposed here generalizes the work of Breckenridge [28].

Clest procedure for estimating the number of clusters in a dataset

Denote the maximum possible number of clusters by M, 2 ≤ M ≤ n. For each number of clusters k, 2 ≤ k ≤ M, perform steps 1-4.

1.
Repeat the following B times:
1. (a)
  Randomly split the original learning set
  into two non-overlapping sets, a learning set ^b and a test set .
2. (b)
  Apply a clustering procedure
  to the learning set ^b to obtain a partition (·;^b).
3. (c)
  Build a classifier C(·;^b) using the learning set ^b and its cluster labels.
4. (d)
  Apply the resulting classifier to the test set
  .
5. (e)
  Apply the clustering procedure
  to the test set to obtain a partition (·;).
6. (f)
  Compute an external index s_k,b comparing the two sets of labels for , namely the labels obtained by clustering and prediction.
2.
Let t_k = median(s_k,1, ..., s_k,B) denote the observed similarity statistic for the k-cluster partition of the data.
3.
Generate B₀ datasets under a suitable null hypothesis. For each reference dataset, repeat the procedure described in steps 1 and 2 above, to obtain B₀ similarity statistics t_k,1, ..., t_k,Bo.
4.
Let
denote the average of these B₀ statistics, , and let p_k denote the proportion of the t_k,b, 1 ≤ b ≤ B₀, that are at least as large as the observed statistic t_k, that is, the p-value for t_k. Finally, let d_k = t_k - denote the difference between the observed similarity statistic and its estimated expected value under the null hypothesis of K = 1.

Define the set K as

K^- = {2 ≤ k ≤ M : p_k ≤ p_max, d_k ≥ d_min},

where p_max and d_min are preset thresholds (see Parameters of the Clest procedure section below). If this set is empty, estimate the number of clusters as = 1. Otherwise, let = argmax_k_K - d_k, that is, take the number of clusters that corresponds to the largest significant difference statistic d_k.

Parameters of the Clest procedure

In this paper, the following decisions are made regarding the different parameters for the Clest procedure (see summary in Table 2).

Table 2 Parameters for Clest

Full size table

Clustering procedure: partitioning around medoids (PAM)

The PAM clustering procedure of Kaufman and Rousseeuw [16], implemented in the cluster package in R and S-Plus, was used to cluster observations based on the Euclidean distance metric (see Background).

Classifier: diagonal linear discriminant analysis (DLDA)

For multivariate Gaussian class conditional densities, that is, for x|y = k ~ N(_k,Σ_k), the maximum likelihood (ML) discriminant rule (or Bayes rule with uniform class priors) predicts the class of an observation x by that which gives the largest likelihood to x, that is,

When the class densities have the same diagonal covariance matrix

, the discriminant rule is linear and given by

For the corresponding sample ML discriminant rules, the population mean vectors and covariance matrices are estimated from a learning set by the sample mean vectors and covariance matrices, respectively:

and . For the constant covariance matrix case, the pooled estimate of the common covariance matrix is used: , where n_k denotes the number of observations in class k and n is the total sample size. DLDA is a very simple classifier but it has been shown to perform well in complex situations, in particular, in an extensive study of discrimination methods for the classification of tumors using gene-expression data [29]. DLDA is also known as naive Bayes classification.

Reference null distribution

The reference datasets are generated under the uniformity hypothesis as in the gap statistic method (see Background).

External index

All the external indices described in Background were considered. The FM index [27] was found to be superior to the other indices when reference datasets are generated under the uniformity hypothesis (data not shown).

Threshold parameters, p_max and d_min

The choices p_max = 0.05 and d_min = 0.05 are ad hoc and can probably be improved upon. Nevertheless, this rule gives a satisfactory performance and is used in the absence of a better choice.

Number of iterations and reference datasets

Here we used B = B₀ = 20. In general, the Clest procedure is robust to the choice of B and B₀ (data not shown).

Comparison of procedures on simulated data

The new procedure Clest was compared to six existing methods presented in Background using data simulated from the models described in Materials and methods. Figure 1 displays bar plots for the percentage of simulations for which a given method correctly recovered the number of clusters for each of the eight models. Table 3 provides a more detailed account of the simulation results for each procedure. It can be seen that Clest gave uniformly good results over the range of models, its worst performance being for Model 7 with two overlapping clusters. The rest of the methods failed for at least one of the eight models considered. The gap procedure failed twice (Models 5 and 6) and gapPC failed once (Model 6). Neither gap nor gapPC were able to identify the presence of the two clusters for Model 6, which is a model with two drawn-out clusters and seven noise variables with varying variances. Both gap and gapPC consistently estimated one cluster for this model, perhaps because both methods are based on the within-clusters sums of squares and consequently are more affected by the variables with larger variances. In a majority of the simulations from Model 7, Clest, gap, and gapPC failed to distinguish between one and two clusters, while the simple hart index performed well. The rest of the procedures do not have, by definition, the ability to estimate one cluster and hence generally identified the two clusters. Interestingly, for Model 8 with three overlapping clusters, sil and ch performed poorly, choosing two clusters in a majority of the simulations, while hart and Clest showed the best performance. Overall, most methods tended to underestimate more often than they overestimated the number of clusters, but the situation was reversed for hart and kl. For Model 1 it is only fair to compare Clest gap, gapPC, and hart, as the other methods only estimate ≥ 2.

Table 3 Estimating the number of clusters in simulated data

Full size table

In summary, for the simulation models considered here, Clest was the most robust and accurate, whereas hart performed worst. gapPC was better than gap and the rest of the methods performed similarly.

For a given model, it is of interest to consider the median value of the statistics used by each method to estimate the number of clusters. For each number of clusters k, the plots of the median values, over the 50 simulated datasets, of the Clest d_k-statistic, gapPC_k, and _k statistics are shown in Figures 2, 3 and 4, respectively. The d-statistic does not generally have local maxima except for Model 5. There, a local maximum appears at K = 4 clusters, but the global maximum occurs at K = 2. It can be seen that the ability of Clest to distinguish between one and two clusters is very low for Model 7; the median of the d₂ values is less than the significance cut-off d_min used in the Clest procedure. Indeed, the results in Table 3 show that Clest identified two clusters for only 30% of the datasets simulated from Model 7. The figures suggest that for the majority of the models, the global maximum of the median d-statistic is more pronounced than the global maxima of the median gapPC_k and _k statistics, respectively. This again suggests good robustness and accuracy properties for the Clest method.

Comparison of procedures on microarray data

The new Clest method was also evaluated using gene-expression data from the four cancer microarray studies described in Materials and methods and summarized in Table 4. Recall that mRNA samples in the lymphoma, leukemia, and NCI60 datasets were assigned class labels from the laboratory analyses of the tumor samples or from a priori knowledge of the cell lines. For the melanoma dataset, tumor class labels were obtained from the statistical analysis described in Bittner et al. [30]. In the discussion that follows, these class labels are treated as known. The six methods described in Background and Clest were applied to estimate the number of clusters for each of the four microarray datasets; the results are presented in Table 5.

Table 4 Description of microarray datasets

Full size table

Table 5 Estimating the number of clusters from microarray data

Full size table

The methods Clest and sil correctly estimated the presumed number of classes for all but the NCI60 dataset, where both methods identified three clusters only. The gap and gapPC methods overestimated the number of clusters for all datasets, with the exception of gapPC, which identified eight clusters for the NCI60 dataset. The ch method estimated two clusters for each of the four datasets, whereas kl and hart identified four classes for the lymphoma dataset.

For Clest, gapPC, and sil, we further investigated how the strength of the evidence for the estimated number of clusters varied between datasets. Figure 5 displays plots of the d_k, gapPC_k, and _k statistics versus the number of clusters k. Error bars for d_k and gapPC_k are based on the standard deviations of t_k and log trW_k under their respective null distributions. Whereas the evidence for the existence of clusters is very strong for the lymphoma, leukemia, and NCI60 datasets, the evidence for the two clusters in the melanoma dataset is much weaker. In particular, for Clest, the maximum value of the d_k statistic barely reaches the d_min threshold of 0.05. For the leukemia dataset, the d_k statistic clearly peaks at k = 3 clusters and drops off abruptly; for the lymphoma and NCI60 datasets the decrease is more gradual. Note that according to Clest there was not enough evidence to identify the two DLBCL subclasses for the lymphoma dataset. Alizadeh et al. [1] identified these subclasses using subject matter knowledge to select the genes for the clustering procedure; here the genes were selected in an unsupervised manner.

Discussion

Resampling methods such as bagging and boosting have been applied successfully in a supervised learning context to improve prediction accuracy. Here and in a related article [7], we have proposed resampling methods to address two main problems in cluster analysis: estimating the number of clusters, if any, in a dataset; improving and assessing the accuracy of a given clustering procedure. As the groups obtained from cluster analysis are often used later on for prediction purposes, the approaches to these two problems rely on and extend ideas from supervised learning. Although the methods are applicable to general clustering problems and procedures, particular attention is given to the clustering of tumors using gene-expression data. The performance of the proposed and existing methods was compared using simulated data and gene-expression data from four recently published cancer microarray studies.

To estimate the number of clusters in a dataset, we propose a prediction-based resampling method, Clest, which estimates the number of clusters K based on the reproducibility of cluster assignments. In comparative studies, Clest was generally found to be more accurate and robust than six existing methods. For the simulated datasets, Clest performed well across a wide range of models with varying numbers of overlapping and non-overlapping clusters, different numbers of variables and covariance matrix structures. Unlike methods based on between- or within-clusters sums of squares, the resampling method seems robust to the varying covariance structure of the variables.

For the microarray datasets, Clest and sil correctly estimated the number of tumor or cell-line clusters (as determined from a priori known or putative tumor and cell-line classes) for three out of the four datasets; the performance of other methods was significantly worse. We focus here on the clustering of tumor mRNA samples using gene-expression data. Once tumor classes are specified, an important next step would be the identification of marker genes that characterize these different tumor classes. A related question, which we have not considered here, is the 'transpose' clustering problem; that is, the clustering of genes that have similar expression levels across biological samples. One could then investigate the clusters for the presence of shared regulatory motifs among the genes [31]. This could lead to the identification of genes that are not only coexpressed but are also under similar regulatory control. Joint analysis of transcript level and sequence data should lead to greater biological insight into the molecular characterization of tumors.

A number of decisions were made regarding the different parameters of the Clest procedure. The clustering procedure PAM was used in the comparison; however, one should keep in mind that different clustering procedures can generate different partitions of the same data, possibly leading to different inferences about the number of clusters. In addition, the clustering (PAM) and prediction methods (DLDA or naive Bayes) considered in this article focus on similar features of the data, namely, the distance of the observations from cluster 'centers'. More work is needed to investigate the robustness of Clest to these choices. In particular, it would be interesting to consider prediction and clustering methods that focus on different aspects of the data (for example, classification trees instead of DLDA). Although it may seem that having a classifier as a parameter of the Clest procedure creates more room for error, we have found that this is not the case in practice. When the classifier in Clest performs poorly, other methods for estimating the number of clusters also perform poorly. Another important choice in the Clest procedure is the reference null distribution used to calibrate the observed similarity statistics t_k for different numbers of clusters. The uniformity hypothesis was used here; a natural alternative would be to consider random permutations of the variables, that is, permutations of the entries of the design matrix within columns. In Clest, the observed similarity statistics t_k are compared across numbers of clusters k by considering their distance from their estimated expected value under the null distribution. A more sensitive calibration may be achieved by taking scale into account, that is, by dividing the difference statistic d_k by the standard deviation of t_k under the null distribution, or even by considering p-values p_k for t_k. We briefly considered these refinements and found that on their own they did not allow good discrimination between the different ks. The Clest method does, however, use the idea of p-value in combination with the differences d_k, as it imposes an upper limit on the p-value p_k. Finally, the choice of cut-off parameters d_min and p_max was rather ad hoc and could be fine tuned.

We have not considered model-based methods, such as the Bayesian approach of Fraley and Raftery [11] or the mixture-model approach of McLachlan et al. [32]. Another issue only briefly addressed here is the selection of variables on which to base the clusterings. For the microarray datasets, genes were selected on the basis of the variance of their expression levels across samples, and it was found that the clusterings were fairly robust to the number of genes.

Resampling methods are promising tools for addressing various problems in cluster analysis. Ben-Hur et al. [33] have recently proposed a stability-based method for estimating the number of clusters, where stability is characterized by the distribution of pairwise similarities between clusterings obtained from subsamples of the data. It would be interesting to relate the approach of Ben-Hur et al. and Clest. Elsewhere, we proposed two bagged clustering methods for improving and assessing the accuracy of a given partitioning clustering procedure [7]. There, the bootstrap is used to generate and aggregate multiple clusterings and to assess the confidence of cluster assignments for individual observations. Leisch [34] proposed a bagged clustering method which is a combination of partitioning and hierarchical methods. A partitioning method is applied to bootstrap learning sets and the resulting partitions are combined by performing hierarchical clustering of the cluster centers. This method is similar in spirit to our two new bagging procedures [7].

Conclusions

Focusing on prediction accuracy in conjunction with resampling produces accurate and robust estimates of the number of clusters. As reproducibility of the cluster assignments is an integral part of the Clest method, the clustering results can be used reliably for building a classifier to predict the class of future observations. In addition, the procedure is robust to the covariance structure among variables.

Materials and methods

Simulation models

Procedures for estimating the number of clusters in a dataset were evaluated using simulated data from a variety of models, including those considered by Tibshirani et al. [24]. The models used for comparison contain different numbers of overlapping and non-overlapping clusters, different numbers of variables, and a wide range of covariance matrix structures. In addition, a variable number of irrelevant or 'noise' variables are included in the models. A noise variable is a variable whose distribution does not depend on the cluster label, and such variables are added to obscure the underlying clustering structure to be recovered.

Model 1. One cluster in 10 dimensions, n = 200 observations are simulated from the uniform distribution over the unit hypercube in p = 10 dimensions.

Model 2. Three clusters in two dimensions. The observations in each of the three clusters are independent bivariate normal random variables with means (0,0), (0,5), and (5,-3), respectively, and identity covariance matrix. There are 25, 25, and 50 observations in each of the 3 clusters, respectively.

Model 3. Four clusters in 10 dimensions, 7 noise variables. Each cluster is randomly chosen to have 25 or 50 observations, and the observations in a given cluster are independently drawn from a multivariate normal distribution with identity covariance matrix. For each cluster, the cluster means for the first three variables are randomly chosen from a N(o₃,25I₃) distribution, where o_p denotes a 1 × p vector of zeros and I_p denotes the p × p identity matrix. The means for the remaining seven variables are 0. Any simulation where the Euclidean distance between the two closest observations belonging to different clusters is less than 1 is discarded.

Model 4. Four clusters in 10 dimensions. Each cluster is randomly chosen to contain 25 or 50 observations, with means randomly chosen as N(o₁₀, 3.6I₁₀). The observations in a given cluster are independently drawn from a normal distribution with identity covariance matrix and appropriate mean vector. Any simulation where the Euclidean distance between the two closest observations belonging to different clusters is less than 1 is discarded.

Model 5. Two elongated clusters in three dimensions. Cluster 1 contains 100 observations generated as follows. Set x₁ = x₂ = x₃ = t, with t taking on equally spaced values from -0.5 to 0.5. Gaussian noise with standard deviation of 0.1 is then added to each variable. Cluster 2 is generated in the same way except that the value 10 is added to each variable. This results in two elongated clusters, stretching out along the main diagonal of a three-dimensional cube, with 100 observations each.

Model 6. Two elongated clusters in 10 dimensions, 7 noise variables. The clusters are generated as in Model 5, but, in addition, seven noise variables are simulated independently from a normal distribution with mean 0 and variance v² for the vth variable, 4 ≤ v ≤ 10.

Model 7. Two overlapping clusters in 10 dimensions, 9 noise variables. Each cluster contains 50 observations. The first variable in each of the two clusters is normally distributed with mean 0 and 2.5, respectively, and with variance 1. The remaining nine variables are simulated from the N(o₉,I₉) distribution (independently of the first variable).

Model 8. Three overlapping clusters in 13 dimensions, 10 noise variables. Each cluster contains 50 observations. The first three variables have a multivariate normal distribution with mean vectors (0,0,0), (2,-2,2), and (-2,2,-2), respectively, and covariance matrix Σ, where σ_ij = 1, 1 ≤ i ≤ 3, and σ_ij = 0.5, 1 ≤ i ≠ j ≤ 3. The remaining 10 variables are simulated independently from the N(o₁₀,I₁₀) distribution.

Note that Models 1, 2, 4, and 5 were considered in Tibshirani et al. [24]. Model 3 is similar to the third model in [24], with the addition of seven noise variables. Model 6 is the same as Model 5, with the addition of seven noise variables.

Fifty datasets were simulated from each model and the methods described in the Background and Results sections were applied to estimate the number of clusters in the resulting datasets. We are primarily interested in comparing the percentage of simulations for which each procedure recovers the correct number of clusters, as this quantity reflects the accuracy of the procedure. However, for the purpose of future applications, it is useful to also know whether a method tends to underestimate or overestimate the true number of clusters. Hence, the full distribution of the number of clusters estimated by each method is presented in Table 3. Note that only the methods Clest, gap, gapPC and hart have the capability to identify one cluster in the data.

Microarray data

The new Clest procedure and existing methods described in Background were applied to gene-expression data from four recently published cancer microarray studies: the lymphoma dataset of Alizadeh et al. [1], the leukemia (ALL/AML) dataset of Golub et al. [3], the 60 cancer cell line (NCI60) dataset of Ross et al. [6], and the melanoma dataset of Bittner et al. [30] (see summary in Table 4). Note that the expression levels are, in general, highly processed data: the raw data in a microarray experiment consist of image files, and important pre-processing steps include image analysis of the scanned images and normalization. Because we chose to use publicly available datasets, most of these decisions were beyond our control, and one should bear in mind that different pre-processing decisions could have a large impact on the measured expression levels [35,36].

Lymphoma

This dataset comes from a study of gene expression in the three most prevalent adult lymphoid malignancies: B-cell chronic lymphocytic leukemia (B-CLL), follicular lymphoma (FL), and diffuse large B-cell lymphoma (DLBCL) (see [1,37] for a detailed description of the experiments). Gene-expression levels were measured using a specialized cDNA microarray, the Lymphochip, containing genes that are preferentially expressed in lymphoid cells or which are of known immunological or oncological importance. In each hybridization, fluorescent cDNA targets were prepared from a tumor mRNA sample (red-fluorescent dye, Cy5) and a reference mRNA sample derived from a pool of nine different lymphoma cell lines (green-fluorescent dye, Cy3). The cell lines in the common reference pool were chosen to represent diverse expression patterns, so that most spots on the array would exhibit a non-zero signal in the Cy3 channel. This study produced gene-expression data for p = 4,682 genes in n = 81 mRNA samples. The tumor mRNA samples consist of 29 cases of B-CLL, 9 cases of FL, and 43 cases of DLBCL. Alizadeh et al. [1] further showed that the DLBCL class is heterogeneous and comprises two distinct subclasses of tumors with different clinical behaviors. The gene-expression data are summarized by an 81 × 4,682 matrix X = (x_ij), where x_ij denotes the base-2 logarithm of the Cy5/Cy3 background-corrected and normalized fluorescence intensity ratio for gene j in lymphoma sample i. The mean percentage of missing observations per array is 6.6% and missing data were inferred as outlined below. The data were standardized as described below.

Leukemia

The leukemia dataset is described in [3] and available at [38]. This dataset comes from a study of gene expression in two types of acute leukemia: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene-expression levels were measured using Affymetrix high-density oligonu-cleotide arrays containing p = 6,817 human genes. The data comprise 47 cases of ALL (38 ALL B-cell and 9 ALL T-cell) and 25 cases of AML. Following Golub et al. (P. Tamayo, personal communication), three pre-processing steps were applied to the normalized matrix of intensity values available on the website (after pooling the 38 mRNA samples from the learning set and the 34 mRNA samples from the test set). First, a floor of 100 and ceiling of 16,000 was set; second, the data were filtered to exclude genes with max/min ≤ 5 or (max - min) ≤ 500, where max and min refer respectively to the maximum and minimum intensities for a particular gene across the 72 mRNA samples; and third, the data were transformed to base 10 logarithms. The data are then summarized by a 72 × 3,571 matrix X = (x_ij), where x_ij denotes the expression level for gene j in mRNA sample i. There are no missing values and the data were standardized as described below. Note that this standardization differs from the one described in Golub et al. [3].

NCI60

In this study, cDNA microarrays were used to examine the variation in gene expression among the 60 cell lines from the National Cancer Institute's (NCI60) anti-cancer drug screen [6,39]. The cell lines were derived from tumors with different sites of origin: 7 breast, 6 central nervous system (CNS), 7 colon, 6 leukemia, 8 melanoma, 9 non-small-cell-lung-carcinoma (NSCLC), 6 ovarian, 2 prostate, 8 renal, and 1 unknown (ADR-RES). Gene expression was studied using microarrays with 9,703 spotted DNA sequences. In each hybridization, fluorescent cDNA targets were prepared from a cell-line mRNA sample (red-fluorescent dye, Cy5) and a reference mRNA sample obtained by pooling equal mixtures of mRNA from 12 of the cell lines (green-fluorescent dye, Cy3). To investigate the reproducibility of the entire experimental procedure (cell culture, mRNA isolation, labeling, hybridization, scanning, and so on), a leukemia (K562) and a breast cancer (MCF7) cell line were analyzed by three independent microarray experiments. Ross et al. screened out genes with missing data in more than two arrays. In addition, because of their small class size, the two prostate cell lines and the unknown cell line (ADR-RES) were excluded from our analysis. The data are summarized by a 61 × 5,244 matrix X = (x_ij), where x_ij denotes the base-2 logarithm of the Cy5/Cy3 background-corrected and normalized fluorescence intensity ratio for gene j in cell line i. The mean percentage of missing observations per array is 3.3% and missing data were inferred as outlined below. The data were standardized as described below.

Melanoma

The melanoma dataset is described in the recent paper of Bittner et al. [30] and is available at [40]. There are 31 melanoma samples and 7 control samples. Gene-expression levels were measured using cDNA microarrays with 8,150 probe sequences, representing 6,971 unique genes. In each hybridization, fluorescent cDNA targets were prepared from a melanoma or control mRNA sample (red-fluorescent dye, Cy5) and a common reference mRNA sample (green-fluorescent dye, Cy3). The following pre-processing steps were applied by Bittner et al. First, a gene was excluded from the analysis if its average mean intensity above background for the least intense signal (Cy3 or Cy5) across all experiments was ≤ 2,000 or its average spot size across all experiments was ≤ 30 pixels; and second, a floor and ceiling of 0.02 and 50, respectively, were applied to the individual intensity log-ratios. This initial screening resulted in a dataset of 3,613 genes (see Supplemental Information to [30], document II, page 2). Finally, Bittner et al. did not include the seven control samples in their analysis. The data are summarized by a 31 × 3,613 matrix X = (x_ij), where x_ij denotes the base-2 logarithm of the Cy5/Cy3 background-corrected and normalized fluorescence intensity ratio for gene j in mRNA sample i. There were no a priori known classes for this dataset, but the analysis of Bittner et al. suggests that two classes may be present in the data, with observations in one of the classes (Group A in their figures) being more tightly clustered. There were no missing values and the data were standardized as described below. Note that this standardization is slightly different from the one described in [30].

Imputation of missing data

For the lymphoma and NCI60 datasets, each array contains a number of genes with fluorescence-intensity measurements that were flagged by the experimenter and recorded as missing data points. Missing data were imputed by a simple k-nearest-neighbor algorithm, in which the neighbors are the genes and the distance between neighbors is based on the correlation between their gene-expression levels across arrays. For each gene with missing data: first compute its correlation with all other p - 1 genes, and then, for each missing array, identify the k nearest genes having data for this array and infer the missing entry from the average of the corresponding entries for the k neighbors. A value of k = 5 neighbors was used for the lymphoma and NCI60 datasets. For a detailed study of methods for filling in missing values in microarray experiments, see [41], which suggests that a nearest-neighbor approach provides accurate and robust estimates of missing values.

Standardization

The gene-expression data were standardized so that the observations (arrays) have mean 0 and variance 1 across variables (genes). Standardizing the data in this fashion achieves a location and scale normalization of the different arrays. In a study of normalization methods, we have found scale adjustment to be desirable in some cases, to prevent the expression levels in one particular array from dominating the average expression levels across arrays [36]. Furthermore, this standardization is consistent with the common practice in microarray experiments of using the correlation between the gene-expression profiles of two mRNA samples to measure their similarity [1,4,6]. In practice, however, we recommend general adaptive and robust normalization methods which correct for intensity, spatial, and other types of dye biases using robust local regression [36].

Preliminary gene selection

Expression levels were monitored for thousands of genes in each of the four studies. However, the majority of the genes exhibit near-constant expression levels, as measured by the variance (or coefficient of variation) of the expression levels across tumor samples. Genes showing nearly constant expression levels are not likely to be useful for classification purposes; therefore, we chose to exclude low-variance genes from the clustering process.

Figure 6 displays for each dataset the individual gene variances divided by the maximum variance over all genes. All four variance curves show a sharp drop-off which gradually flattens. The plots are remarkably similar for all the datasets, with the melanoma dataset having the fastest drop-off. In this report, the p = 100 most variable genes were used to analyze the leukemia, lymphoma and melanoma datasets, and the p = 200 most variable genes were used for the NCI60 dataset as it contains more classes. Increasing the number of genes to p = 300-400 or decreasing the number of genes to p = 50 did not have much effect on the results (data not shown). One could also select genes based on a coefficient of variation filter.

Correlation matrices

The following is not part of the cluster analysis per se, but is an interesting side-step which may be predictive of the results of the forthcoming analysis. Recall that for the first three datasets, tumor classes were known a priori, and for the melanoma dataset two classes were inferred by Bittner et al. [30] through cluster analysis. For each dataset, images of the n × n correlation matrix for the n mRNA samples are displayed in Figures 7,8,9,10, with observations grouped according to their a priori known or putative classes. Note that if observations are highly correlated within classes, the correlation image in this representation should show bright red squares along the diagonal.