- Open Access
Model-based cluster analysis of microarray gene-expression data
© Pan et al., licensee BioMed Central Ltd 2002
- Received: 26 September 2001
- Accepted: 28 November 2001
- Published: 29 January 2002
Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic.
The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels.
Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.
- Posterior Probability
- Akaike Information Criterion
- Otitis Medium
- Bayesian Information Criterion
- Normal Mixture
The pattern of genes expressed in a cell can provide important information about the cell state. DNA microarray technology can measure the expression of thousands of genes in a biological sample. DNA microarrays have been increasingly used in the last few years and have the potential to help advance our biological knowledge at a genomic scale [1,2]. In analyzing DNA microarray gene-expression data, a major role has been played by various cluster-analysis techniques, most notably by hierarchical clustering , K-means clustering  and self-organizing maps . These clustering techniques contribute significantly to our understanding of the underlying biological phenomena. A recent review of various methods is provided by Tibshirani et al. . However, many methods, including the three mentioned above, have some restrictions, one of which is their inability to determine the number of clusters. The difficulty may be related to the fact that in many methods there is no clear definition of what a cluster is in the first place. Furthermore, their clustering results may not be stable [7,8]. An important clustering technique that improves on and/or provides alternative solutions to these issues is model-based clustering (see, for example, ). It has a clear definition that a cluster is a subpopulation with a certain distribution, and several statistical methods can be applied to estimate the number of clusters. Some authors have considered its application to cluster gene-expression patterns [10,11,12].
Here we consider the use of model-based clustering in the context of detecting differentially expressed genes, which is to identify all the genes with altered expression under two experimental conditions (for example, normal cells versus cancer cells). We note that the goal here is different from that of clustering gene-expression patterns, as done by other researchers in using model-based clustering. In modeling differential expression levels of genes, it is natural to assume that genes are from two subpopulations, one with constant and another with changed expression levels. Hence, a two-component mixture is a reasonable model. This is the approach proposed by Lee et al. , where it is assumed that each of the two components has a normal (in the statistical sense) distribution. However, in general, each component does not necessarily have a normal distribution. It is well known that many distributions can be well approximated by a finite mixture of normal distributions. Hence, the normal mixture model-based clustering can be regarded as a more general and flexible approach along these lines and we pursue this approach here. In particular, we summarize a possible change of expression of a gene using a t-statistic, which automatically accounts for differential variations of expression levels across genes. Then we apply model-based clustering to these t-statistics to exploit which genes have differential expression levels. The methodology is illustrated with an application to a dataset containing the expression levels of 1,176 genes of normal rats and those with pneumococcal middle-ear infection.
Data and preprocessing
On the basis of the above observation, we calculate the following two-sample t-statistic for each gene as its measure of possible differential expression:
for i = 1, ...,1176. The numerator of y i is the difference of average gene-expression levels under the two conditions (infected versus control), whereas the denominator is the sample standard error of the numerator and serves to standardize the observed difference by penalizing those with large (and thus less reliable) variations. Previous studies have found evidence that genes may have differential variability of expression levels [15,16,17]. Note that although the t-statistic is constructed, we shall not conduct t-tests because there is no evidence to support the questionable normality assumption required by the t-test. We also do not carry out permutation or other nonparametric tests  because of the small sample size (that is, 2 + 4). This is also related with the fact that there exists the problem of multiple comparisons if we test gene by gene . Our goal here is to apply model-based cluster analysis to the preprocessed relative gene-expression levels y i , i = 1, ..., 1176, and see which genes will have relative levels far away from the majority.
Finite mixtures of distributions provide a flexible as well as rigorous approach to modeling various random phenomena (for example, ). For continuous data, such as gene-expression data, the use of normal components in the mixture distribution is natural. With a normal mixture model-based approach to clustering, it is assumed that the data to be clustered are from several subpopulations (or clusters or components) with distinguished normal distributions. That is, each data point y is taken to be a realization from a normal mixture distribution with the probability density function:
where φ (y;μ i ,V i ) denotes the normal density function with mean μ i and (co)variance matrix V i , and π i 's are mixing proportions. We use Φ g to represent all unknown parameters (π i , μ i , V i ): i = 1, ... g in a g-component (or g-cluster) mixture model.
The mixture model is typically fitted by maximum likelihood using the expectation-maximization (EM) algorithm . Given n observations y1, ..., y n , we want to maximize the log-likelihood
Suppose that at the kth iteration, the parameter estimates are π i (k)'s, μ i (k)'s and V i (k)'s. Then in the (k + 1)th iteration, the estimates are updated by
for i = 1, ..., g where
is the posterior probability that y j belongs to the ith component of the mixture, using the current parameter estimate for Φ g , for i = 1, ..., g and j = 1, ..., n.
One interesting but difficult problem in cluster analysis is to determine the number of components g. In contrast to many other approaches that fail to accomplish this goal, model-based clustering provides several useful and objective selection criteria, which have been used in other model selection problems. The best known are the Akaike Information Criterion (AIC)  and the Bayesian Information Criterion (BIC) :
where v g is the number of independent parameters in Φ g . In using the AIC or BIC, one first fits series of models with various values of g, then one picks up the g with the smallest AIC or BIC.
In many studies related to model selection, it is found that AIC may select too large a model whereas BIC may select too small a model. This phenomenon appears to hold in selecting g in the mixture analysis . Some other criteria have been studied but there does not seem to be a clear winner . Banfield and Raftery  proposed using approximate weight of evidence as an approximate Bayesian model selection criterion. Some empirical studies seem to favor the use of BIC . We feel that a combined use of AIC and BIC is helpful, at least in providing a range of reasonable values of g.
A different approach to selecting g is through hypothesis testing. This could be done through the use of the log-likelihood ratio test (LRT) to test for the null hypothesis H0: g = g0 against the alternative H1: g = g0 +1 for any given positive integer g0. The LRT statistic is 2 log L( 0+1) - 2 log L( 0), which, however, does not have the usual asymptotically chi-squared distribution as a result of violation of required regularity conditions (for example, the maximum likelihood estimate may lie in the boundary of its parameter space). McLachlan  proposed using the bootstrap to approximate the distribution of the LRT statistic under the null hypothesis. On the basis of the resulting p value, one can decide whether to reject H0.
McLachlan et al.  have implemented model-based clustering in a stand-alone Fortran program called EMMIX, which is freely available from the web . It supports all the functions we described above, including multiple start of the EM algorithm using random partition or K-mean clustering, calculation of the model selection criteria AIC and BIC, and the use of the bootstrap to test a given number of components g0. We will use EMMIX to analyze the gene-expression data described earlier.
The MCLUST software , implementing model-based clustering, is also freely available . It is designed to interface with the commercial statistical package S-Plus. For users familiar with S-Plus, it is convenient to take advantage of the power and flexibility of S-Plus. However, at the same time, it can have some serious restrictions on the size of the data being analyzed because of the overhead on CPU speed and memory induced by S-Plus.
Clustering results with various number of components g
The fitted mixture model is
f(y; ) = 0.042 x N(6.74, 77.07) + 0.510 × N(0.88, 5.56) + 0.448 × N(-0.31, 1.15).
Furthermore, the posterior probability can serve as a quantitative measurement of the strength of each gene being classified into each cluster. For instance, among 30 genes classified into the first cluster, there are respectively 17, 18, 20 and 21 genes with a posterior probability of being in the first cluster greater than 0.99, 0.95, 0.90 and 0.85. Hence, those 17 or 18 genes are likely to have expression levels significantly different from those of other majority genes. The posterior probability might also provide information about possible misclassifications. In addition to those classified into cluster 1, there might be other observations classified into the other two clusters but nevertheless with not too small probabilities of being classified into cluster 1. The lower right panel of Figure 4 shows six such observations, all belonging to cluster 2 but with probabilities of being in cluster 1 ranging from 0.30 to 0.48. These six genes show somewhat differential gene-expression levels, but the evidence is not strong and more experiments may be needed to verify this.
We hope we have shown that model-based clustering is a powerful method that is useful in analyzing gene-expression data. It is flexible as well as intuitively understandable. However, it does have some limitations. Although it provides posterior probabilities for classification results, in the context of detecting differentially expressed genes its use is more in the line of exploratory data analyses. For instance, in our example, we treat cluster 1 as representing genes with changed expression whereas clusters 2 and 3 consist of genes without expression changes. Although this treatment is reasonable, it is somewhat subjective and is debatable. Some new statistical approaches [31,32,33] are interesting alternatives that provide a more quantitative answer to detecting genes with altered expression, but they require replicates of spots or arrays. Model-based clustering is less restrictive and can be applied to data without replicates and to cluster (relative) gene-expression levels directly .
Three young pathogen-free Sprague-Dawley rats were inoculated with pneumococcus in phosphate-buffered saline (PBS) and served as the pneumococcus group. Three other rats inoculated with PBS served as controls. All animals were sacrificed on day 42 after inoculation. The bullae from each of the pneumococcus- or PBS-inoculated groups were pooled and submitted for mRNA purification. Purified mRNAs, [α -32P]dATP, dNTP mix and reverse transcriptase were incubated at 50°C for 25 min for the synthesis of radioactively labeled cDNA probes. The Atlas cDNA array membranes (Atlas rat 1.2 array, Clontech, CA) were hybridized with the cDNA probes and nonspecific binding washed away. Specific binding of cDNA probes with the membranes was scanned into a computer and the radioactive signal intensities of specific binding were quantitated with the OptiQant software (version 3.0, DeltaPackard, Boston, MA) and presented in digitalized light unit (DLU). The intensity level in DLU is the observed gene-expression level. As described earlier, the log-transformation was conducted on the intensity level in DLU, and the centering and scaling procedures were followed using the log-transformed data. The original data representing the intensity level (in DLU) for each gene from each of the six experiments are available from our website .
This research was partially supported by NIH grants.
- Brown P, Botstein D: Exploring the new world of the genome with DNA microarrays. Nat Genet. 1999, 21 (Suppl): 33-37.PubMedView ArticleGoogle Scholar
- Lander ES: Array of hope. Nat Genet. 1999, 21 (Suppl): 3-4.PubMedView ArticleGoogle Scholar
- Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22: 281-285. 10.1038/10343.PubMedView ArticleGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with self-organizing maps: methods and applications to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.PubMedPubMed CentralView ArticleGoogle Scholar
- Tibshirani R, Hastie T, Eisen M, Ross G, Botstein D, Brown P: Clustering methods for the analysis of DNA microarray data. Technical Report, Department of Statistics, Stanford University,. 1999, [http://www-stat.stanford.edu/~tibs/research.html]Google Scholar
- Zhang K, Zhao H: Assessing reliability of gene clusters from gene expression data. Funct Integr Genomics. 2000, 1: 156-173. 10.1007/s101420000019.PubMedView ArticleGoogle Scholar
- Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci USA. 2001, 98: 8961-8965. 10.1073/pnas.161273698.PubMedPubMed CentralView ArticleGoogle Scholar
- McLachlan GL, Basford KE: Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker,. 1988Google Scholar
- Holmes I, Bruno WJ: Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc 8th Int Conf Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press,. 2000Google Scholar
- Barash Y, Friedman N: Context-specific Bayesian clustering for gene expression data. Proc Fifth Annual Int Conf Computational Biology. New York: Association for Computing Machinery,. 2001Google Scholar
- Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Model-based clustering and data transformations for gene expression data. Bioinformatics. 2001, 17: 977-987. 10.1093/bioinformatics/17.10.977.PubMedView ArticleGoogle Scholar
- Lee M-LT, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA. 2000, 97: 9834-9839. 10.1073/pnas.97.18.9834.PubMedPubMed CentralView ArticleGoogle Scholar
- Friemert C, Erfle V, Strauss G: Preparation of radiolabeled cDNA probes with high specific activity for rapid screening of gene expression. Methods Mol Cell Biol. 1998, 1: 143-153.Google Scholar
- Chen Y, Dougherty ER, Bittner ML: Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics. 1997, 2: 364-367. 10.1117/1.429838.View ArticleGoogle Scholar
- Ideker T, Thorsson V, Siehel AF, Hood LE: Testing for differentially-expressed genes by maximum likelihood analysis of microarray data. J Comput Biol. 2000, 7: 805-817. 10.1089/10665270050514945.PubMedView ArticleGoogle Scholar
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001, 8: 37-52. 10.1089/106652701300099074.PubMedView ArticleGoogle Scholar
- Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report, Statistics Department, University of California-Berkeley,. 2000Google Scholar
- Titteringto DM, Smith AFM, Makov UE: Statistical Analysis of Finite Mixture Distributions. New York:Wiley,. 1985Google Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B. 1977, 39: 1-38.Google Scholar
- Akaike H: Information theory and an extension of the maximum likelihood principle. In 2nd Int Symp Information Theory. Edited by Petrov BN, Csaki F. Budapest: Akademiai Kiado,. Edited by: . 1973, 267-281.Google Scholar
- Schwartz G: Estimating the dimensions of a model. Annls Statistics. 1978, 6: 461-464.View ArticleGoogle Scholar
- Biernacki C, Govaert G: Choosing models in model-based clustering and discriminant analysis. J Stat Comput Simulation. 1999, 64: 49-71.View ArticleGoogle Scholar
- Banfield JD, Raftery AE: Model-based Gaussian and non-Gaussian clustering. Biometrics. 1993, 49: 803-821.View ArticleGoogle Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering methods? Answers via model-based cluster analysis. Computer J. 1998, 41: 578-588.View ArticleGoogle Scholar
- McLachlan GL: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Statistics. 1987, 36: 318-324.View ArticleGoogle Scholar
- McLachlan GL, Peel D, Basford KE, Adams P: Fitting of mixtures of normal and t-components. J Stat Software. 1999, 4: 2-[http://www.jstatsoft.org/v04/i02/]View ArticleGoogle Scholar
- EMMIX. [http://www.maths.uq.oz.au/~gjm/emmix/emmix.html]
- Fraley C, Raftery AE: MCLUST: Software for model-based cluster analysis. J Classification. 1999, 16: 297-306. 10.1007/s003579900058.View ArticleGoogle Scholar
- Model-based Clustering Software. [http://www.stat.washington.edu/fraley/mclust]
- Efron B, Tibshirani R, Goss V, Chu G: Microarrays and their use in a comparative experiment. Technical Report, Department of Statistics, Stanford University,. 2000, [http://www-stat.stanford.edu/~tibs/research.html]Google Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Pan W, Lin J, Le C: A mixture model approach to detecting differentially expressed genes with microarray data. Technical Report 2001-011, Division of Biostatistics, University of Minnesota,. 2001, [http://www.biostat.umn.edu/cgi-bin/rrs?print+2001]Google Scholar
- Wei Pan website. [http://www.biostat.umn.edu/~weip]