- Open Access
Classification methods for the development of genomic signatures from high-dimensional data
© Moon et al.; licensee BioMed Central Ltd. 2006
Received: 28 July 2006
Accepted: 20 December 2006
Published: 20 December 2006
Personalized medicine is defined by the use of genomic signatures of patients to assign effective therapies. We present Classification by Ensembles from Random Partitions (CERP) for class prediction and apply CERP to genomic data on leukemia patients and to genomic data with several clinical variables on breast cancer patients. CERP performs consistently well compared to the other classification algorithms. The predictive accuracy can be improved by adding some relevant clinical/histopathological measurements to the genomic data.
Providing guidance on specific therapies for pathologically distinct tumor types to maximize efficacy and minimize toxicity is important for cancer treatment [1, 2]. For acute leukemia, for instance, different subtypes show very different responses to therapy, reflecting the fact that they are molecularly distinct entities, although they have very similar morphological and histopathological appearance . Thus, accurate classification of tumor samples is essential for efficient cancer treatment on a target population of patients. Microarray technology has been increasingly used in cancer research because of its potential for classification of tissue samples based only on gene expression data, without prior and often subjective biological knowledge [1, 3, 4]. Much research involving microarray data analysis is focused on distinguishing between different cancer types using gene expression profiles from disease samples, thereby allowing more accurate diagnosis and effective treatment of each patient.
Gene expression data might also be used to improve disease prognosis in order to prevent some patients from having to undergo painful unsuccessful therapies and unnecessary toxicity. For example, adjuvant chemotherapy for breast cancer after surgery could reduce the risk of distant metastases; however, seventy to eighty percent of patients receiving this treatment would be expected to survive metastasis-free without it [5, 6]. The strongest predictors for metastases, such as lymph node status and histological grade, fail to classify accurately breast tumors according to their clinical behavior [6, 7].
Predicting patient response to therapy or the toxic potential of drugs based on high-dimensional data are common goals of biomedical studies. Classification algorithms can be used to process high-dimensional genomic data for better prognostication of disease progression and better prediction of response to therapy to help individualize clinical assignment of treatment. The predictive models built are required to be highly accurate, since the consequence of misclassification may result in suboptimal treatment or incorrect risk profile. Commonly, there are numerous genomic and clinical predictor variables over a relatively small number of patients for biomedical applications, which presents challenges for most traditional classification algorithms to avoid over-fitting the data.
Class prediction is a supervised learning method where the algorithm learns from a training set (known samples) and establishes a prediction rule to classify new samples. Development of a class prediction algorithm generally consists of three steps: first, selection of predictors; second, fitting the prediction model to develop the classification rule; and third, performance assessment. The first two steps build a prediction model, and the third step assesses the performance of the model. Some classification algorithms, such as the classification tree or stepwise logistic regression, perform the first two steps simultaneously. Sensitivity (SN) and specificity (SP) as well as positive predictive value (PPV) and negative predictive value (NPV) are primary criteria used in the evaluation of the performance of a classification algorithm. The SN is the proportion of correct positive classifications out of the number of true positives. The SP is the proportion of correct negative classifications out of the number of true negatives. The accuracy is the total number of correct classifications out of the total number of samples. The PPV is the probability that a patient is positive given a positive prediction, while the NPV is the probability that a patient is negative given a negative prediction. Algorithms with high SN and high SP as well as high PPV and high NPV, which will have high accuracy, are obviously desirable.
Recently, a new ensemble-based classification algorithm, Classification by Ensembles from Random Partitions (CERP) has been developed . This algorithm is designed specifically for high-dimensional data sets. Rationales behind CERP are as follows: first, multiple classifiers can capture most aspects of the underlying biological phenomena encoded in the data; and second, combining results of multiple diversified models can produce a superior classifier for biomedical decision making. In this paper, we use Classification-Tree CERP (C-T CERP), which is an ensemble of ensembles of optimal classification trees based on the Classification and Regression Trees (CART) algorithm , constructed with randomly partitioned mutually exclusive subsets from the entire predictor set. The number of features in each subset is as close to equal as possible.
The performance of CERP is compared to other well-known classification algorithms: Random Forest (RF) , Boosting [11, 12], Support Vector Machine (SVM) , Diagonal Linear Discriminant Analysis (DLDA) , Shrunken Centroids (SC) , CART, Classification Rule with Unbiased Interaction Selection and Estimation (CRUISE) , and Quick, Unbiased and Efficient Statistical Tree (QUEST) . CERP utilizes a partitioning scheme to establish mutually exclusive subsets of the predictors. On the other hand, RF takes bootstrap samples of patients for each tree and randomly selects predictors with replacement from the entire set of predictors at each node. Boosting gives extra weight to previously misclassified samples. Like CERP, RF and Boosting are ensemble classifiers. SVM is a kernel-based machine learning approach. DLDA is a classification rule based on a linear discriminant function. SC is based on an enhancement of the simple nearest centroid classifier. CART, CRUISE and QUEST are single optimal trees. Among these single-tree algorithms, CART and QUEST yield binary trees and CRUISE yields multiway splits.
In this study, the classification algorithms are applied to three popular public data sets relevant to personalized medicine. The algorithms are first used for the prediction of leukemia subtypes, acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), based on gene-expression data . They are then used on two different data sets [6, 17] to predict which breast cancer patients would benefit from adjuvant chemotherapy based on gene-expression data. We also investigate if addition of seven more clinical/histopathological variables, including age, tumor size, tumor grade, angioinvasion, estrogen receptor status, progesterone receptor status and lymphocytic infiltrate, to the high-dimensional genomic data on breast cancer patients  enhances classification accuracy. The performance of the classification algorithm is assessed by 20 replications of 10-fold cross-validation (CV).
Determination of cancer type and stage is often crucial to the assignment of appropriate treatment . Because chemotherapy regimens for patients with ALL are different from regimens for patients with AML, distinguishing between leukemia subtypes (ALL or AML) is critical for personalized treatment. Golub et al.  described a generic approach to cancer classification of the two subtypes of acute leukemia based on gene expression monitoring by DNA microarray technology. The data set consists of 47 patients with ALL and 25 patients with AML. The gene expression levels were measured by Affymetrix high-density oligonucleotide arrays containing 6,817 human genes. Before performing normalization, the data were preprocessed by the following steps: thresholding, with a floor of 100 and ceiling of 16,000; filtering, with exclusion of genes with max/min ≤5 or (max - min) ≤500, where max and min refer to the maximum and minimum expression levels of a particular gene across 72 mRNA samples, respectively; and base-10 logarithmic transformation. The data were then summarized by 72 mRNA samples and 3,571 genes .
Performance of classification algorithms for the leukemia data based on 20 repetitions of 10-fold CV
Breast cancer classification
The objective of two studies [6, 17] was to use gene expression data to identify patients who might benefit from adjuvant chemotherapy according to prognostication of distant metastases for breast cancer. The van 't Veer et al. data  contains 78 primary breast cancers (34 from patients who developed distant metastases within 5 years (poor prognosis) and 44 from patients who continue to be disease-free (good prognosis) after a period of at least 5 years). These samples have been selected from patients who were lymph node negative and under 55 years of age at diagnosis. Out of approximately 25,000 gene expression levels, about 5,000 significantly regulated genes (at least a two-fold difference and a p value of less than 0.01) in more than 3 tumors out of 78 were selected . In addition, seven relevant clinical/histopathological predictors were added to this gene expression data to investigate if the addition of these variables improves the prediction accuracy compared to genomic data only.
In the study of van de Vijver et al. , there was a cohort of young women with stage I or II breast cancer who were treated at the hospital of the Netherlands Cancer Institute. They were younger than 53 years old, 151 of whom were lymph-node-negative and 144 of whom were lymph-node-positive. Among 295 patients, 180 had a poor-prognosis signature and 115 had a good-prognosis signature. From approximately 25,000 human genes, we selected about 5,000 genes according to correlation of the microarray data with the prognosis profile . There were no missing data.
Performance of classification algorithms for the van 't Veer et al. breast cancer genomic data based on 20 repetitions of 10-fold CV
Performance of classification algorithms for the van 't Veer et al. breast cancer genomic and clinical/histopathological data based on 20 trials of 10-fold CV
Recent advancements in biotechnology have accelerated research on the development of molecular biomarkers for the diagnosis and treatment of disease. The Food and Drug Administration envisions clinical pharmacogenomic profiling to identify patients most likely to benefit from particular drugs and patients most likely to experience adverse reactions. Such patient profiling will enable assignment of drug therapies on a scientifically sound predictive basis rather than on an empirical trial-and-error basis. The goal is to change medical practice from a population-based approach to an individualized approach.
We have presented statistical classification algorithms to accurately classify patients into risk/benefit categories using high-dimensional genomic and other data. Classification algorithms were illustrated by three published data sets and the new C-T CERP was compared to the best known published classification procedures. CERP is a consistently good algorithm and maintains a good balance between sensitivity and specificity even when sample sizes between classes are unbalanced.
In one application, leukemia patients were classified as having either ALL or AML based on each individual patient's gene-expression profile. The distinction is important because the chemotherapies required for the two subtypes are very different, and incorrect treatment assignment has both efficacy and toxicity consequences. Classification algorithms are essential for the realization of personalized medicine in this application, because distinguishing ALL and AML otherwise requires an experienced hematologist's interpretation of several analyses performed in a highly specialized laboratory. CERP correctly classified patients with the lowest cross-validated error rate of 1.4% (0 or 1 misclassification) compared to the other classification procedures we considered (more than 1 misclassification). This level of accuracy shows the real potential for confident clinical assignment of therapies on an individual patient basis.
In the other application, post-surgery breast cancer patients were classified by the algorithms as having either a good or poor prognosis, in terms of the likelihood of distant metastasis within five years, based on gene-expression profiles. If this were brought into clinical application, a patient with a confidently predicted good prognosis might want to elect out of adjuvant chemotherapy and its associated debilitating side effects. With current rule-based decisions, almost all patients are subjected to chemotherapy. When just a few clinical and histopathological measures traditionally used for treatment assignment were added to the numerous genomic predictors, the prediction accuracy appeared to be enhanced further. According to the theory underlying the CERP algorithm, importantly, the more individual patient information that is used, whatever the source or type, the greater is the likelihood that the prediction accuracy will increase. While the van 't Veer et al. data  do not contain enough information to allow confident prognoses, the van de Vijver et al. data  show improved cross-validated overall accuracy that might be sufficiently high for clinical practice. It is worth noting that CERP and all the other methods do not perform as well as the method reported in the van 't Veer et al.  study (62.3% versus 83% accuracy). It may be that the feature selection method used by van 't Veer et al. overfit the data and they did have a true cross-validation test. They appear to have used correlation with outcome for feature selection outside the cross-validation procedure. It is anticipated that the combined use of multiple biomarkers on individual patients could improve the prediction accuracy of data like the present genomic data to a level suitable for clinical practice.
Materials and methods
Ensemble methods to enhance prediction accuracy
Let X i be a random variable indicating a classification by the i-th independent classifier, where X i = 1 if the classification is correct and X i = 0 if not. We let p be the prediction accuracy of each classifier. Then the X i are Bernoulli(p), and the number of accurate classifications by the ensemble majority voting method is:
which is Binomial(r, p). We let r = 2k + 1, where k is a nonnegative integer. We define the prediction accuracy of the ensemble by majority voting as:
A r = P(Y ≥ k + 1).
Then the prediction accuracy of the ensemble can be obtained using the standard binomial probability:
It has been shown that the majority vote is guaranteed to give a higher accuracy than an individual classifier when the individual classifiers have an accuracy greater than 0.5 . In practice, the classifiers may be correlated to a certain degree. When classifiers are positively correlated, they tend to produce the same prediction outcomes. Kuncheva et al.  relaxed the restriction that the classifiers be independent. When the classifiers in the ensemble are positively correlated, we use the beta-binomial model [19–21] to obtain the prediction accuracy. The beta-binomial model is commonly used to model positively correlated binary variables.
Enhancement of the prediction accuracy by ensemble majority voting*
Prediction accuracy of each base classifier
In C-T CERP, we employ majority voting among trees within individual ensembles and then among ensembles. In an ensemble, using training data, only the trees that have highest sensitivity and specificity (>90%) are kept, which reduces each ensemble down to a small number of tree classifiers. When the selected trees are less than three in an ensemble, the cut-off value is decreased by five percent increments until at least three trees are selected. New ensembles are created by randomly re-partitioning the feature space and similarly reducing to a different set of classifiers. Most of the improvement in adding ensembles was achieved by the first few ensembles, and then the improvement was slowed down as more ensembles were added . In this paper, we fixed the default number of ensembles as 15 according to our preliminary results. Final ensemble prediction is then based on the majority vote across these ensembles. C-T CERP is implemented in C. A potential user can obtain the software by contacting the authors or by downloading from the worldwide web site .
A package (RandomForest) in R is used for the RF algorithm. The number of trees is generated using the default of ntree = 500. The number of features selected at each node in a tree is selected using the default value of floor(m 1/2), where m is the total number of features. Similarly, a package (e1071) in R is applied for the SVM, in which radial basis kernel is used as a default. Among many boosting methods, AdaBoost  is adopted using a package (boost) in R with a default option. For DLDA, a package (sma) in R is employed with a default option. SC is implemented with a package (pamr) in R with a soft thresholding option as a default. For single optimal trees, CART is implemented with a package (rpart) in R with a default option. On the other hand, compiled binaries are downloaded from the website , and implemented in R for CRUISE and QUEST.
In many cases, the number of features (m) is much greater than the number of patients (n). In such a case, cross-validation is used to obtain a valid measure of prediction accuracy for genomic signature classifiers. CV utilizes resampling without replacement of the entire data set to repeatedly develop classifiers on a training set and evaluates classifiers on a separate test set, and then averages the procedure over the resamplings.
We evaluated the prediction accuracy, the balance between sensitivity (SN) and specificity (SP), and the balance between positive predictive value (PPV) and negative predictive value (NPV) of the classification algorithms considered by averaging the results from 20 replications of 10-fold CV in order to achieve a stable result. Twenty CVs should be sufficient according to Molinaro et al.  who recommended ten trials of ten-fold CV to have low MSE and bias.
Hongshik Ahn's research was partially supported by the Faculty Research Participation Program at the NCTR administered by the Oak Ridge Institute for Science and Education through an interagency agreement between USDOE and USFDA.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
- Zhang H, Yu C-Y, Singer B, Xiong M: Recursive partitioning for tumor classification with gene expression microarray data. Proc Natl Acad Sci USA. 2001, 98: 6730-6735. 10.1073/pnas.111153698.PubMedPubMed CentralView ArticleGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002, 97: 77-87. 10.1198/016214502753479248.View ArticleGoogle Scholar
- Alexandridis R, Lin S, Irwin M: Class discovery and classification of tumor samples using mixture modeling of gene expression data - a unified approach. Bioinformatics. 2004, 20: 2545-2552. 10.1093/bioinformatics/bth281.PubMedView ArticleGoogle Scholar
- Early Breast Cancer Trialists' Collaborative Group: Polychemotherapy for early breast cancer: an overview of the randomised trials. Lancet. 1998, 352: 930-942. 10.1016/S0140-6736(05)61359-1.View ArticleGoogle Scholar
- van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530-536. 10.1038/415530a.PubMedView ArticleGoogle Scholar
- McGuire WL: Breast cancer prognostic factors: evaluation guidelines. J Natl Cancer Inst. 1991, 83: 154-155. 10.1093/jnci/83.3.154.PubMedView ArticleGoogle Scholar
- Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL: Classification by ensembles from random partitions. Technical Report. 2006, SUNYSB-AMS-06-03, Stony Brook University, Department of Applied Mathematics and StatisticsGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, California: WadsworthGoogle Scholar
- Breiman L: Random forest. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Freund Y, Schapire R: A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci. 1997, 55: 119-139. 10.1006/jcss.1997.1504.View ArticleGoogle Scholar
- Schapire R: The strength of weak learnability. Mach Learn. 1990, 5: 197-227.Google Scholar
- Vapnik V: The Nature of Statistical Learning Theory. 1995, New York: SpringerView ArticleGoogle Scholar
- Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002, 99: 6567-6572. 10.1073/pnas.082099299.PubMedPubMed CentralView ArticleGoogle Scholar
- Kim H, Loh W-Y: Classification trees with unbiased multiway splits. J Am Stat Assoc. 2001, 96: 589-604. 10.1198/016214501753168271.View ArticleGoogle Scholar
- Loh W-Y, Shih Y-S: Split selection methods for classification trees. Stat Sinica. 1997, 7: 815-840.Google Scholar
- van de Vijver MJ, He YD, van 't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al: A gene-expression signature as a predictor of survival in breast cancer. New Engl J Med. 2002, 347: 1999-2009. 10.1056/NEJMoa021967.PubMedView ArticleGoogle Scholar
- Kuncheva LI, Whitaker CJ, Shipp CA, Duin RPW: Limits on the majority vote accuracy in classifier fusion. Pattern Anal Appl. 2003, 6: 22-31. 10.1007/s10044-002-0173-7.View ArticleGoogle Scholar
- Williams DA: The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics. 1975, 31: 949-952. 10.2307/2529820.PubMedView ArticleGoogle Scholar
- Ahn H, Chen JJ: Generation of over-dispersed and under-dispersed binomial variates. J Comput Graph Stat. 1995, 4: 55-64. 10.2307/1390627.Google Scholar
- Ahn H, Chen JJ: Tree-structured logistic regression model for over-dispersed binomial data with application to modeling developmental effects. Biometrics. 1997, 53: 435-455. 10.2307/2533948.PubMedView ArticleGoogle Scholar
- CERP. [http://www.ams.sunysb.edu/~hahn/research/CERP.html]
- QUEST. [http://www.stat.wisc.edu/~loh/]
- Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005, 21: 3301-3307. 10.1093/bioinformatics/bti499.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.