Cluster-Rasch models for microarray gene expression data
- Hongzhe Li^{1}Email author and
- Fangxin Hong^{1}
https://doi.org/10.1186/gb-2001-2-8-research0031
© Li and Hong, licensee BioMed Central Ltd 2001
Received: 26 February 2001
Accepted: 19 June 2001
Published: 31 July 2001
Abstract
Background
We propose two different formulations of the Rasch statistical models to the problem of relating gene expression profiles to the phenotypes. One formulation allows us to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes; this model can also be used for future prediction. The other formulation provides an alternative way of identifying genes that are over- or underexpressed from their expression levels in tissue or cell samples of a given tissue or cell type.
Results
We illustrate the methods on available datasets of a classification of acute leukemias and of 60 cancer cell lines. For tumor classification, the results are comparable to those previously obtained. For the cancer cell lines dataset, we found four clusters of genes that are related to drug response for many of the 90 drugs that we considered. In addition, for each type of cell line, we identified genes that are over- or underexpressed relative to other genes.
Conclusions
The cluster-Rasch model provides a probabilistic model for describing gene expression patterns across samples and can be used to relate gene expression profiles to phenotypes.
Background
Recently, DNA chip or microarray technology has been developed that allows researchers to measure the expression levels of thousands of genes simultaneously over different time points, different experimental conditions or different tissue samples. It is based on the hybridization of DNA or RNA molecules with a library of complementary strands fixed on a solid surface. Oligonucleotide chips contain thousands of features with gene-specific sequences about 25 bases long. These oligos are then hybridized with labeled probe derived from a given tissue or cell line. The resulting fluorescence intensity gives information about the abundance of the corresponding mRNA. This is the Affymetrix DNA chip technology. Alternatively, cDNA can be spotted on nylon filters or glass slides. Complex mRNA probes are reverse transcribed to cDNA and labeled with red or green fluorescent dyes. This technique is often called the spotted array or cDNA array. In both methods, thousands of mRNA concentrations can be measured in parallel, potentially revealing complex gene regulatory networks.
One important application of the microarray gene expression data in medicine is to study the relationship between tissue phenotypes and gene expression profiles on the whole-genome scale. The phenotype could be several different types of cancers [1,2,3], responses of cell lines to different chemical compounds [4], or time to tumor recurrence after treatment. For binary phenotypes such as two different types of cancers, the problem becomes the classification of patients' samples. It has been suggested that gene expression may provide the additional information needed to improve cancer classification and diagnosis [4]. For continuous phenotypes such as drug sensitivity, the problem of interest is to relate gene expression patterns to sensitivity to drugs and, therefore, aid in the process of drug discovery and provide a rationale for selection of therapy on the basis of the molecular characteristics of a patient's tumor.
From the statistical point of view, the challenge is that the microarray gene expression data are often measured with a great deal of noise, and that the sample size of tissues or cell lines, denoted by n, is usually very small compared to the number of genes in expression arrays, denoted by p. This results in the 'large p, small n' problem [5]. Most current approaches to dealing with this problem first select genes that can best separate tissues of different types by performing univariate analysis. The expression levels of these genes are then combined linearly in a weighted way to form compound covariates. These covariates are then used in the standard regression models for model fitting and prediction. West et al. [5] proposed a Bayesian binary regression approach using the singular value decomposition to first reduce the dimension of the variable space (p) to the number of samples (n). They called the resulting linear combination of the expression levels of all the genes the expression of the 'supergenes'. All these approaches reduce the variable space by making one or several linear combinations of the expression levels of some or all of the genes. Linear combination may not, however, be the best way of reducing the dimension of the variable space.
Another popular approach to analyzing gene expression data is to use clustering methods to simultaneously cluster both samples and genes in order to determine some clusters of genes that are mostly correlated with some clusters of samples. Examples of such an application include analysis of gene expression data and drug response for the 60 human cancer cell lines of the National Cancer Institute (NCI60 data) [4], and analysis of cancerous and normal colon tissues [2]. However, the clustering approach is purely exploratory and requires an external similarity measure. Methods that can be used to assess the significance of the clustering results are needed.
The Rasch model (RM) and its extensions [6,7] are an important staple of psychological research and are used in other fields such as sociology, educational testing and medicine. The idea of the RM is that one can indirectly infer a person's position on a latent trait from his/her responses to a set of well-chosen items. For example, the RM has been used to infer the quality of life of cancer patients from their answers to a well-designed questionnaire [8], or to measure disability from activities of daily living [9]. For these applications, data are usually given in a matrix, with rows being individuals and columns being responses to a set of items. Microarray gene expression data are also given in a transposable matrix form with rows being genes and columns being samples, and vice versa. The RM can therefore be used to explain the observed patterns over different columns. Here we propose two different formulations of the polytomous RM for analysis of microarray gene expression data. The first formulation treats samples as 'persons' and genes as 'items'. The idea is to infer several latent factors associated with a given sample on the basis of its expression profile over many genes. We combine a model-based clustering method [10] with the RM to define a small set of latent factors associated with samples. For a given sample, we assume that genes in the same cluster determine one latent factor associated with this sample, and use the RM to estimate this latent factor for each gene cluster. These latent factors are then used in a regression analysis of the observed phenotypes. The rational of this approach is that genes of similar function yield similar expression patterns in microarray hybridization experiments [11,12,13]. Co-regulated genes may share similar expression profiles, maybe involved in related functions or regulated by common regulatory elements [14]. Therefore, if genes are clustered together, it is impossible from a statistical point of view to differentiate one gene from the other. In this case, a better way of studying these genes is to treat them as a cluster. Consideration of genes in the same cluster can potentially reduce noise associated with a single gene.
The second formulation is to treat genes as 'persons' and samples as 'items'. The idea is to infer several latent factors associated with each gene based on its expression levels across samples from different tissue or cell types. This formulation provides simple summary statistics for genes based on their expression profiles over samples, and helps to identify genes that are more likely be over- or underexpressed within samples of the same type or between samples of different types. We first briefly review some key ideas of the polytomous RM and its estimation. We then present two different formulations of the RM for the gene expression data. Details involved in these formulations are given. We apply our proposed methods to the analysis of the leukemia dataset [1] and the NCI60 dataset [4] and conclude with discussion of our method.
Results
The Rasch model (RM)
The RM was originally proposed as an item-response theory model in the psychological test or attitude scale [6]. The idea is that the use of a test or scale presupposes that one can indirectly infer a person's position on a latent trait from his/her responses to a set of well-chosen items. Assume that we have I persons and J items. Let Z_{ ij } be the response of individual i to the item j, where the response can take one from m + 1 possible ordinal categories, 0,..., m. One version of the RM, which we use in this paper, called the partial credit model [15], assumes the probability of response h, as
Therefore, there is a total of Jm - 1 unconstraint item-specific parameters.
The item parameters can be estimated based on the conditional likelihood, given minimal sufficient statistics for the person parameters. For a given person, the minimal sufficient statistic is the sum of the category weights corresponding to the observed responses. After the β parameters are estimated, the person parameters can then be estimated by maximizing the likelihood function. Details on the conditional likelihood estimation of the item parameters can be found in Anderson [16].
Relating gene expression profiles to phenotypes
Typical microarray data consist of expression levels for a large number of genes on a relatively small number of samples. Let x_{ ij } be the gene expression level of the jth gene in the ith sample, for i = 1,..., n, and j = 1,..., p. In practice, n is usually much smaller than p. For spotted arrays, to moderate the influence of gene expression ratios above and below one, we may apply the natural log transform to all the red to green ratios [12]. Upregulated genes thus have a positive log expression ratio, whereas downregulated genes have a negative log expression ratio. Or x_{ ij } might be the expression level from an oligonucleotide array. In addition, for the ith sample, we have observed phenotype y_{ i }, which could be a binary indicator such as two different types of cancer, a continuous measurement such as drug-response activity or censored survival time such as time to tumor recurrence. To apply the RM to the gene expression data x_{ ij }, we first need to discretize the gene expression levels x_{ ij } into z_{ ij }, which takes value from 0,..., m, for i = 1,..., n, j = 1, ..., p. In practice, we can use the quantiles or the quantiles within quantiles as cut-off points for discretization. Because this approach uses only ranks rather than the actual expression levels, there may be slight loss of information. However, in return, we gain a valid analysis with robustness to the outliers.
Outline of the approach
Our goal is to infer several latent factors associated with each sample based on its gene expression profile, and relate these latent factors to the observed phenotypes. Using the terms of the RM, we treat each gene as an 'item', each tissue sample or cell line as a 'person', and treat the expression level as the response of a given tissue to a given gene. The unidimensional RM may not, however, hold for the complete set of genes generated by microarrays. Here we assume that genes with similar expressions determine one latent factor, and that the RM holds for each set of genes with similar expression profiles.
To identify genes with similar expression profiles over samples, we first use the model-based clustering method of Fraley and Raftery [10] to cluster p genes into K clusters based on their gene expression profiles over n samples. Note that the cluster-step is performed based on the observed continuous gene expression data, not the discretized gene expression patterns. For a given sample, using expression profiles of genes in a given cluster, we estimate a latent factor by fitting a RM. These latent factors are then used in a regression analysis to study the relationship between the gene expressions and the phenotype. The method allows to investigate whether a cluster of genes with similar expression profiles is related to the observed phenotypes, and can also be used for future prediction by estimating the latent factors using the maximum likelihood estimation. We give some details for each of these steps in the following sections.
Model-based clustering analysis
Cluster analysis, based on multivariate normal mixture models [10,17], has been used for clustering various types of biological, zoological, financial and industrial data. We first set up the mixture model for the gene expression data. Let x_{ j } = {x_{1j}, ..., x_{ nj }} be the n-dimensional vector of the jth gene expression over n samples. We assume that the gene expression values of p genes, x_{1}, ..., x_{ p }, arise from a mixture of K n-dimensional Gaussian distribution with density
where the τ_{ k } is the probability that a gene belongs to the kth cluster, and φ_{ n }(X|μ_{ k },Σ_{ k }) denotes the density function of the multivariate normal distribution with mean μ_{ k } and variance-covariance matrix Σ_{ k } Possible parameterization of the covariance matrix is discussed in Fraley and Raftery [10]. Note that if we assume a simple covariance structure, Σ_{ k } = λ I, where I is the identity matrix, and λ is the variance, then the model-based clustering method becomes the K-means clustering method [18].
Treating clustering as a mixture model problem allows us to use the EM algorithm to estimate the probability of a given gene belonging to each of the K clusters, and to estimate the corresponding mean vector and covariance matrix for each cluster [10]. One advantage of this approach is that it allows us to obtain an estimate of the number of gene clusters. Following Fraley and Raftery [10], we propose to use the Bayesian inference criterion (BIC) [19] for selecting the number of gene clusters. BIC is defined as
BIC(K) = 2L(K) - n_{ K } log p,
where L(K) is the maximized log-likelihood, n_{ k } is the number of independent parameters to be estimated in the K-cluster model and p is the sample size (number of the genes). We will choose K that gives the maximum BIC(K) value.
As the first step of our approach, we cluster genes into K clusters using the model-based clustering method described above, where the number of gene clusters K is determined by maximizing the BIC scores. Let C_{ k } denote the genes in cluster k, and p_{ k } denote the number of genes in this cluster, for k = 1,..., K.
Rasch model and regression analysis
We fit a RM model as in equation (1) for genes in each of the K clusters respectively, treating samples as 'persons' and genes as 'items'. To fit the RM (equation 1) for genes in the kth cluster, we let i be the sample index, and j be the gene index, for i = 1,..., n, and j = 1,..., p_{ k }, and let α_{ i } = α_{ ik } in model (1) be the latent factor for the ith sample which is determined by the genes in the kth cluster, and β_{ jl } be the gene-specific parameter for the jth gene. The RM assumes that the variation of the gene expression patterns observed over different samples is due to a latent factor, and it provides a probabilistic model to describe the gene expression pattern for a given sample.
for i = 1,..., n, where γ is the vector of regression parameters, ε_{ i }, is the error term, and the actual model of the regression function (f) and the distribution of the error depend on type of the phenotype. If the phenotype is a continuous variable, linear regression can be used, if it is a binary variable, logistic or probit regression can be used, and if the phenotype is survival time, the Cox regression model can be used. Alternatively, the generalized linear model can be used. One advantage of the proposed method is that we can model the interactions between the latent factors in the standard way of modeling interactions in the regression models. As only the estimated and not the observed latent factors are available in the regression model (equation 3), the variance of the estimate of the γ parameter has to be corrected. We propose to use a two-step bootstrap resampling procedure [20] to estimate the variance of the estimate of the parameter γ in the regression model. First, within the kth gene cluster, we resample genes and re-estimate the α_{ ik } parameter by fitting the RM (1). Second, for a given set of estimated α parameters, we resample the n samples from (y_{ i }, _{ i }), for i = 1,..., n, and fit the regression model (equation 3) to obtain a new estimate of γ. We can then estimate the variance of with these resampled estimates.
Prediction
For a new sample with gene expression x_{ new } = (x_{new,1},...,x_{new,p}), these p genes are first divided into K clusters according to the clustering result. For a gene j in cluster k, we first discretize its expression level into one of the m + 1 categories using the cut-off points used in the discretization-step, denoted by z_{new,j}. We can then estimate the corresponding latent factor α_{ k } by maximizing the following likelihood function,
RM for latent factors associated with genes
The second formulation of the RM for gene expression data is to treat genes as 'persons', and samples as 'items'. Assume that we have gene expression data of p different genes, indexed by i, over n_{ k } samples of the kth sample type, indexed by j, for k = 1, ..., K. Note that the indices i and j are used differently from the previous sections. We are interested in identifying genes that are expressed differently among these different sample types. Here each gene has its own expression patterns over different samples. For gene i, we can estimate a latent factor α_{ ik } based on its gene expression profile over n_{ k } samples from the kth sample type by fitting the RM (equation 1), for k = 1, ..., K. In this formulation of the RM, we treat genes as 'persons', treat samples as 'items', and treat each gene's expression level over samples as the responses. In RM (1), I = p, and J = n_{ k }, α_{ i } = α_{ ik }, which is the latent factor for the ith gene determined by the samples of the kth type, and β_{ jl } is the sample-specific parameter. This model assumes that the variation of gene expression patterns across different samples among different genes is due to several gene-specific latent factors. Here the latent factor α_{ ik } can be interpreted as some quantities related to the transcription factors of the ith gene which determine the gene expression levels in samples of the kth sample type. For a given tissue or cell line type k, genes with larger estimated latent factor (α_{ ik }) tend to have higher expression levels than those with smaller estimated latent factor. For a given sample type k, the estimated latent factors, ( _{1k}}, ..., _{ pk }), provide a nice way to order genes based on their expression levels over a small number of samples of sample type k, and to identify genes that are relatively over- or underexpressed in the kth sample type. In addition, by comparing the estimated latent factors associated with genes across different sample types, we can identify genes that are differentially expressed among different tissue or cell line types.
Analysis of the leukemia dataset
Classification using cluster-RM
We applied the proposed approach to the problem of classifying acute leukemias. Acute leukemias can broadly be divided into two classes, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), that originate, respectively, from cells of myeloid or lymphoid origin. The two diseases appear identical under the microscope. However, correct diagnosis is critical, as they respond best to different treatment regimens. Golub et al. [1] used a set of 38 leukemia samples including 11 AML and 27 ALL as training samples set, and used an additional 34 samples (14 AML, 20 ALL) as a test set for testing their proposed method for class prediction. In our analysis, we combined both the training and the test datasets into a dataset of 72 samples (25 AML and 47 ALL). For each of the 72 samples, the gene expression data were extracted from Affymetrix expression arrays.
Given the 100 genes selected, methods such as principal component analysis, partial least-square regression or composite covariate predictor can be used to further reduce the gene dimension to two or three dimensions by taking linear combinations of the gene expression levels. Instead of taking linear combination of gene expression levels, we first discretize the expression levels of all the 100 genes over 72 samples into four categories using the quantiles as the cut-off points. Therefore, for each gene, their expression level can take one of four possible values of 0, 1, 2 and 3. The same analysis was also done by discretizing gene expression levels into eight categories, the results were essentially the same. In the following, we only present the results using four categories. Fitting two RMs to these discretized gene expression levels, we estimate two latent factors for each sample; one latent factor is determined by gene expression profiles of 50 genes in one cluster, the other is determined by the gene expression profiles of 50 genes in another cluster. The right panel of Figure 2 shows the estimated values of these two latent factors for all the 72 samples. This plot shows that the two leukemia types are well separated by these two latent factors, with no overlap, except that two leukemia samples, one from ALL group and the other from AML group, are close to each other in this two-dimensional space.
Discriminant analysis using these two latent factors would expect to perform very well in classification. We performed a leave-one-out cross validation analysis to estimate the misclassification rate. Specifically, we leave one sample out, and first estimate the sample-specific latent factors α_{ ik } for the ith gene for k = 1,2 and gene-specific parameter β_{ jl } using the remaining samples. We then estimate the latent factors of the left-out sample by maximizing the likelihood function (Equation 4). Fisher's linear discriminant analysis using the estimated latent factors was then used to classify the left-out sample. The above procedure was applied to each of the 72 samples, and resulted in a misclassification rate of 2/72 = 3%. We use this example to demonstrate that two latent factors carry most of the information of the gene expression levels of the 100 genes.
Summary of gene expression profiles
Analysis of NC160 dataset
Relating gene expression profiles to drug activities
Genes over- and underexpressed in breast-cancer cell line
Overexpressed |
TP53 tumor protein p53 (Li-Fraumeni syndrome) |
Antiquitin |
Elongation factor TU, mitochondrial precursor |
Human mRNA for collagen-binding protein 2 |
Homo sapiens intermediate conductance calcium-activated potassium |
channel |
H. sapiens mRNA for phosphoenolpyruvate carboxykinase |
H. sapiens inactive palmitoyl-protein thioesterase-2i (PPT2) mRNA |
H. sapiens lysosomal neuraminidase precursor |
Underexpressed |
Tumor-associated antigen CO-029 |
Probable trans-1,2-dihydrobenzene-1,2-diol dehydrogenase |
Human fetus brain mRNA for membrane glycoprotein M6 |
Human mitochondrial 1,25-dihydroxyvitamin D3 24-hydroxylase mRNA |
H. sapiens DAP-kinase mRNA |
Carbonic anhydrase II antileukoproteinase 1 precursor |
Each of the 60 cell lines is now characterized by four different latent factors, where each latent factor is estimated based on the expression profiles of the genes in each of the four clusters. It would be interesting to relate these four latent factors to the drug activity patterns as measured by -log GI_{50} across the 60 cell lines. For a given drug, we first performed a simple linear regression analysis treating the drug activity as response variable and using one of the four latent factors as a predictor, and obtained the parameter estimate of γ in the following model:
drug activity = μ + γx latent factor.
In order to relate the drug activity of a given drug to all the four latent factors, we performed multiple linear regression analysis where drug activity for a given cell line was treated as a response variable, and the four latent factors were treated as the predictors. The right plot of Figure 6 shows the estimates of the parameters in the multiple regression model for each of the 90 drugs. This plot can be used for selecting drugs that are related to gene expression profiles. For example, only drugs with at least one large parameter estimate are important for further study, as only for these drugs, their activity levels are related to gene expression profiles.
Identifying genes differentially expressed in different cell lines
Discussion
We have described two different formulations of the RM for relating gene expression data to phenotypes. The RM provides a probabilistic model to describe the observed gene expression patterns. The first formulation can be used for cancer class prediction, and for identifying clusters of genes with similar expression profiles that are related to drug responses. The method is based on a combination of clustering analysis, the RM and the regression analysis. The second formulation can be used for identifying differentially expressed genes from different types of sample. We applied this method to a publicly available leukemia dataset to demonstrate the application of the proposed method for class prediction. We also applied this method to an analysis of the NCI60 data to show that the method can also be applied to other phenotypes.
The method introduced here has several advantages. First, it provides a probabilistic model for describing the gene expression patterns. These models are used to reduce the complexity of the raw data and offer a certain degree of simplification. In contrast to most of the currently available methods for gene expression data, such as principal component analysis, the model used here provides a non-linear method for dimension reduction. Second, compared with traditional density-based models, the method is more robust to outliers, as it uses ranks rather than actual expression levels. There is a long sequence of steps in the laboratory as well as in the image analysis before a single number is produced for an expression level, and there are many potential sources of error. Methods that use ranks rather than the original measured gene expression data are also advocated by A Tsodikov, A Szabo and D Jones (unpublished data) and Park et al. [21]. Third, in contrast to clustering methods for simultaneously clustering both genes and samples, the model-based approach allows formal estimates of the variance and therefore facilitates formal tests of null hypotheses and assignments of confidence intervals.
There are several limitations to the proposed approach. First, in order to apply the RM, the gene expression levels are dis-cretized. It is clear that by discretizing the measured expression levels we are losing information. Certainly, additional information is present in the level of gene expression, but normalization or scaling errors across subjects or slides make it difficult to determine the precision of these numbers. We believe that discretization provides a reasonably unbiased approach for dealing with this type of data. For both the leukemia and NCI60 datasets, we fitted the RMs by discretizing the gene expression levels into both four and eight categories, and obtained essentially the same conclusions. Of course, the more categories we use, the closer are the discretized data to the real continuous data. However, this will introduce more parameters to the model. Second, this approach assumes that genes can be clustered into several subgroups based on their expression profiles over samples. This might not be true for some studies. In this case, we can consider all the possible clusters of genes in a step-wise regression analysis as proposed by Hastie et al. [22]. Third, we used the bootstrap resampling procedure to estimate the variance of the regression parameter β after we have clustered genes into several classes. This procedure does not account for possible variability associated with the clustering step, and therefore, the bootstrap variance estimates are likely to be underestimated.
In our proposed method, the clustering, the Rasch modeling and the regression analysis are done separately. Important research for the future is to take a joint likelihood approach that can combine all three steps to obtain better estimates of the number of clusters, the latent factors and the regression parameters. This kind of mixture RM provides a natural framework for unifying statistical inference and clustering. We are currently carrying out research in this direction. In conclusion, we demonstrate here the potential application of the RMs in analysis of gene expression data. RMs provide a probability model for describing gene expression profiles measured over different samples or over different times. We are currently exploring various other formulations of the microarray gene expression problems in the framework of the RMs, including class discovery in cases of hidden taxonomies based on the estimated latent factors using the RM.
Declarations
Acknowledgements
This research is supported in part by an NIH grant (ES09911) and a UC Davis Health System Research Award grant.
Authors’ Affiliations
References
- Golub TR, Slonim DK, Tamayo P, Huard C, Caasenbeek M, Mesirov JP, Colle Hr, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1016/S0378-4371(00)00404-0.PubMedView ArticleGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.PubMedPubMed CentralView ArticleGoogle Scholar
- Alizadeh A, Eisen M, Davis RE, Ma C, Lossos I, Rosenwal A, Boldrick J, Sabet H, Tran T, Yu X, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403: 503-511. 10.1038/35000501.PubMedView ArticleGoogle Scholar
- Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Kohn KW, Rein-hold WC, Myers TG, Andrews DT, Scudiero DA, et al: A cDNA microarray gene expression database for the molecular pharmacology of cancer. Nat Genet. 2000, 24: 236-244. 10.1038/73439.PubMedView ArticleGoogle Scholar
- West M, Nevins J, Marks JR, Spang A, Zuan H: Bayesian Regression Analysis in the "Large p, Small n" Paradigm with Application in DNA Microarray Studies. Technical Report. Duke University,. 2000, [http://www.isds.duke.edu]Google Scholar
- Rasch G: Probabilistic Models for Some Intelligence and Attachment Tests. Copenhagen: Danish Institute of Educational Research,. 1960Google Scholar
- Fischer GH, Molenaar IW, (eds): Rasch Models: Foundations, Recent Developments, and Applications. New York: Springer;. 1995Google Scholar
- Jaen J, Alvarez P, Roman P, Alonso E, Bayo E, Salas C: Rasch model: An available method for measuring the quality of life of cancer patients. Oncologia (Madrid). 1994, 17: 49-58.Google Scholar
- Sheehan TJ, DeChello LM, Garcia R, Fifield J, Rothfield N, Reisine S: Measuring disability: application of the Rasch model to activities of daily living (ADL/IADL). J Outcome Measurement. 2001, 4: 2000-681.Google Scholar
- Fraley C, Raftery AE: How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer J. 1998, 41: 578-588.View ArticleGoogle Scholar
- Wen X, Fuhrman S, Michaels GS, Carr DB, Smith S, Baker JL, Somogyi R: Large-scale temporal gene expression mapping of central nervous development. Proc Natl Acad Sci USA. 2000, 95: 334-339. 10.1073/pnas.95.1.334.View ArticleGoogle Scholar
- Eisen M, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.PubMedPubMed CentralView ArticleGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Sacccharomyces cerevisae by microarray hybridization. Mol Biol Cell. 1998, 9: 3273-3297.PubMedPubMed CentralView ArticleGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nat Genet. 1999, 22: 281-285. 10.1038/10343.PubMedView ArticleGoogle Scholar
- Masters GN: A Rasch model for partial credit scoring. Psychometrika. 1982, 47: 149-174.View ArticleGoogle Scholar
- Andersen EB: Polytomous Rasch models and their estimation. In Rasch Models: Foundations, Recent Developments, and Applications. Edited by Fischer GH, Molenaar IW. New York: Springer;. 1995Google Scholar
- McLachlan G, Basford K: Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker;. 1988Google Scholar
- Hartingan JA, Wong MA: Algorithm AS136: A k-means clustering algorithm. Appl Statistics. 1979, 28: 10-108.Google Scholar
- Schwarz G: Estimating the dimension of a model. Annls Statistics. 1978, 6: 461-464.View ArticleGoogle Scholar
- Efron B: The Jacknife, the Bootstrap, and Other Resampling Plans. Philadelphia: Society for Industrial and Applied Mathematics,. 1982View ArticleGoogle Scholar
- Park PJ, Pagano M, Bonetti M: A nonparametric scoring algorithm for identifying informative genes from microarray data. Proc Pacific Symp Biocomput. 2001, 6: 52-63.Google Scholar
- Hastie T, Tibshirani R, Botstein D, Brown P: Supervised harvesting of expression trees. Genome Biol. 2001, 2: research0003.1-003.12. 10.1186/gb-2001-2-1-research0003.View ArticleGoogle Scholar