Genome-driven integrated classification of breast cancer validated in over 7,500 samples
© Ali et al.; licensee BioMed Central Ltd. 2014
Received: 22 July 2014
Accepted: 1 August 2014
Published: 28 August 2014
Skip to main content
© Ali et al.; licensee BioMed Central Ltd. 2014
Received: 22 July 2014
Accepted: 1 August 2014
Published: 28 August 2014
IntClust is a classification of breast cancer comprising 10 subtypes based on molecular drivers identified through the integration of genomic and transcriptomic data from 1,000 breast tumors and validated in a further 1,000. We present a reliable method for subtyping breast tumors into the IntClust subtypes based on gene expression and demonstrate the clinical and biological validity of the IntClust classification.
We developed a gene expression-based approach for classifying breast tumors into the ten IntClust subtypes by using the ensemble profile of the index discovery dataset. We evaluate this approach in 983 independent samples for which the combined copy-number and gene expression IntClust classification was available. Only 24 samples are discordantly classified. Next, we compile a consolidated external dataset composed of a further 7,544 breast tumors. We use our approach to classify all samples into the IntClust subtypes. All ten subtypes are observable in most studies at comparable frequencies. The IntClust subtypes are significantly associated with relapse-free survival and recapitulate patterns of survival observed previously. In studies of neo-adjuvant chemotherapy, IntClust reveals distinct patterns of chemosensitivity. Finally, patterns of expression of genomic drivers reported by TCGA (The Cancer Genome Atlas) are better explained by IntClust as compared to the PAM50 classifier.
IntClust subtypes are reproducible in a large meta-analysis, show clinical validity and best capture variation in genomic drivers. IntClust is a driver-based breast cancer classification and is likely to become increasingly relevant as more targeted biological therapies become available.
The classification of breast tumors based on morphology (histological type and grade) and two key markers, estrogen receptor (ER) and human epidermal growth factor receptor 2 (HER2), remains the mainstay of current clinical practice. Early attempts to improve this situation by using genomic technology focused on data-driven methods including unsupervised transcriptome-based classification [1-3] and gene signatures trained against a specific clinical outcome [4-6]. However, this approach is not based on the underlying molecular changes which ultimately constitute a tumor’s oncogenic drive. More recent genomic studies have begun to reveal the complexity of the landscape of somatic alterations in breast cancer at the levels of mutations and copy number alterations (CNAs) [7-12]. The strategy for discriminating between driver and passenger events amongst these somatic alterations has, for non-synonymous mutations, focused on identification of genes more frequently mutated than expected by chance in a given collection of tumor samples. Although this approach has required some adjustment owing to the non-random background mutation rates in cancer genomes  and may be complemented by accounting for the pattern of mutational distribution within genes , it does provide a roadmap for the comprehensive identification of all driver mutations if a sufficiently large sample size is interrogated . In the case of CNAs, an additional strategy has been to integrate genomic and transcriptomic data in order to identify areas of recurrent alteration associated with deregulated gene expression (expression quantitative trait loci (eQTLs)) [16-18]. Importantly, the balance between somatic mutations and alterations in copy number has been investigated as part of the The Cancer Genome Atlas (TCGA) pan-cancer analysis of 12 tumor types . Investigation of a shortlist of ‘selected functional events’ revealed an approximately inverse relationship between mutation and CNAs with some tumor types dominated by mutations deemed ‘M-class’ (for example, renal cell carcinoma and colorectal adenocarcinoma), while others were dominated by CNAs deemed ‘C-class’ . Prototypical ‘C-class’ tumor types were ovarian and breast cancer. This analysis highlights the need for a classification scheme based on the pattern of somatic driver alterations in a particular tumor, which, in the case of breast tumors, is dominated by CNAs. Using the largest sample collection with extensive genomic, transcriptomic and clinical annotation in existence, we previously described a scheme for classifying breast tumors into 10 subtypes based on the pattern of CNAs which exert a concordant effect on gene expression in cis (eQTLs). This classification was named IntClust owing to the clustering of tumors based on the integration of genomic and transcriptomic data  to find probable driver events . The scheme remains the only genome-wide driver-based classification of breast cancer that reconciles tumor genomes with their transcriptomes and, as such, has significant potential for rational patient stratification . Further validation of the clinical and biological significance of this approach requires a reliable method to subtype tumors in independent cohorts assayed on different platforms. This is, in part, due to the relative scarcity of studies for which both high-resolution copy-number and transcriptomic data are available, since the classification requires both data types. Here, we have overcome this hurdle by developing a flexible method for tumor subtyping which only requires gene expression data and is not limited to specific platforms. This gene expression-based classifier has enabled us to investigate the IntClust classification in the numerous translational studies for which transcriptomic and clinical data are publically available. Here, we report on the reproducibility of IntClust subtypes, their clinical validity and the extent to which they capture the landscape of somatic driver alterations in breast cancer using these external independent studies.
In order to quantify the efficacy of our method by study, we used a correlation statistic to estimate the goodness of fit of the classification model where a score of 1.0 indicates perfect correlation between the gene expression profiles of new samples and those contained within the index dataset. Figure 1C depicts the correlation (goodness of fit) statistics, number of samples and number of features (of a possible 714) for every study. This comparison of average gene expression profiles by subtype indicates a striking conservation of patterns across studies with the average correlation being 0.69. The highest correlation of 0.95 was, as expected, associated with the METABRIC validation dataset. The next highest correlation of 0.92 related to RNA-seq samples from TCGA. The lowest correlation was a significant outlier among studies at 0.1. Although it was not possible to definitively determine the basis for this poor correlation, we note that the distribution of ESR1 and ERBB2 expression was not bimodal for this study and, in general, there appeared to be a low signal-to-noise ratio. The Pearson’s correlation coefficient between goodness of fit and number of samples per study was 0.53 and between goodness of fit and number of features per study was 0.38. As a comparator, we also classified samples into the ‘intrinsic subtypes’ using the PAM50 classifier  and into four molecular subtypes based on three genes (ESR1, ERBB2 and AURKA) using the SCMGENE classifier . We evaluated the effect of platform variability on subtype assignment by using 475 samples from the TCGA study for which gene expression data had been collected using both RNA-seq and microarrays. Cross-tabulations of subtype assignment with Kappa-agreement statistics, by data type (RNA-seq or microarray) for each of the three classifiers (SCMGENE, PAM50 and IntClust) are presented in Additional file 5. The agreement between classifiers was 93.1% for SCMGENE, 93.7% for PAM50 and 81.3% for IntClust. It should be noted that the number of possible classes significantly influences the rate of concordance for a classification. The expected agreement by chance alone for SCMGENE (four groups) was 29.8%, for PAM50 (five groups) was 33.2% while for IntClust (ten groups) was 12.0%. Similarly, when interpreting the importance of discordantly classified cases, the number of possible classes should be taken into account since the relative difference between classes is likely to be smaller for a classification comprising a larger number of possible groups.
We also applied our classifier to a large panel of cell lines from two data repositories (Sanger COSMIC database and the Cancer Cell Lines Encyclopedia (CCLE)). We applied three versions of our classifier to these data: copy number data alone, gene expression alone and the combined copy number/gene expression feature set. The goodness of fit statistics for these classifiers are depicted in a scatter plot in Additional file 6. Overall, the copy number-based classifier performed better than the expression-based or combined classifier. The ensemble goodness of fit for the copy number-based classifier was 0.74 using the Sanger dataset and 0.75 using the CCLE dataset, compared with the ensemble average goodness of fit for the expression-based classifier, which was 0.47 using the Sanger dataset and 0.62 using the CCLE dataset. These differences may be due to variation in culture conditions and passages, which are more likely to be reflected in gene expression than in CNAs. Weighted scatterplots depicting cell line classification according to classifier type and by dataset are presented in Additional file 6. Similarly, comparison of classification between PAM50 and SCMGENE datasets are depicted in Additional file 7. There was considerable variability in subtype assignment for cell lines according to the origin of the data for all classifiers. This highlights the challenge of reliable cell line classification, which is likely due to drift over time and variability in cell culture conditions. Our findings show that, on average, copy-number profiles of cell lines are more similar to primary tumors than gene expression profiles and ought to be preferentially used for their classification into molecular subtypes. Details of molecular subtype assignment for each cell line by data source are presented in Additional file 8.
A subset of patients in some of the studies received neo-adjuvant (before definitive surgery) chemotherapy, and tissue would have been derived from biopsies or fine needle aspirates. Here, we note that based even on these samples, IntClust subtype could be reliably assigned and resulted in proportions comparable to those from studies in patients who did not receive neo-adjuvant chemotherapy (Figure 2A). This implies that it is possible to reliably assign tumors to IntClust subtypes based on biopsy material alone as might be undertaken in clinical practice. Overall, similar proportions of each of the 10 subtypes were found in external studies in comparison to the METABRIC reference study (Figure 2B). Moreover, the relative composition of each IntClust subtype in terms of the proportion of different ‘intrinsic’ subtypes that comprised it was very similar between the METABRIC study and external samples (Figure 2B). The inverse of the plot in Figure 2B, depicting the IntClust subtype composition of each of the ‘intrinsic’ subtypes classified according to PAM50 and SCMGENE is presented in Additional file 9.
In order to evaluate the relative contribution of each classifier to the prediction of relapse-free survival, we compared the discrimination of survival prediction models. These models comprised the molecular (SCMGENE, PAM50, IntClust) subtypes as categorical variables and were adjusted for tumor size (<1, 1 to 2, 2 to 3, 3 to 5, >5 cm), node status (negative versus positive) and histological grade (1, 2 or 3). The coefficients for these models were derived using Cox-regression in the METABRIC dataset and then applied to external studies with available data in order to avoid over-optimistic estimates. Harrell’s C-index was used to estimate the relative discrimination of models where an index of 1 reflects perfect discrimination between high and low risk patients while an index of less than 0.5 reflects discrimination which is no better than chance. We conducted analyses separately by ER status and within three brackets of follow-up time (0 to 4, 4 to 8 and 8 to 15 years) in order to account for violations of Cox-proportional hazards assumption  and to estimate differences in model performance for short- versus long-term survival prediction. Additional file 11 depicts the results of these analyses. In general, the performance of all three models was significantly better in ER-positive breast cancer, particularly during the first 5 years of follow-up, compared with ER-negative disease. The relative performance of the three models was comparable in both ER-positive and ER-negative breast cancer. Both IntClust (P = 0.005) and SCMGENE (P = 0.03) significantly outperformed PAM50 in the prediction of late events (8 to 15 years) in ER-positive breast cancer (Additional file 11). However, it should be noted that, particularly for late events (81 events in ER-positive disease), these analyses may be underpowered and, as a consequence, preclude robust conclusions being drawn. These analyses show that the IntClust classifier performs at least as well as transcriptome-based classification in the prediction of relapse-free survival.
A second determinant of the relative utility of a disease classification scheme is whether differences in chemosensitivity are reflected in different subtypes. In order to investigate this, we used a collection of breast cancer studies where patients had received neo-adjuvant cytotoxic chemotherapy [26-29] and for whom data on pathological complete response (pCR) were available (N = 871). A tumor is said to have undergone pCR if, following surgery, no residual tumor cells remain upon pathological examination. pCR has been shown to be a powerful predictor of long-term survival . Distinct patterns of pCR between molecular subtypes of breast cancer have been reported previously, with the highest rates observed in ER-negative tumors and the lowest in ER-positive HER2-negative tumors . Similarly, distinct patterns of pCR were observed by molecular subtype (Figure 3C). The highest rates of pCR by IntClust subtyping were observed within the IntClust 10 subtype at 37% (45/121) compared with the highest rate by PAM50 classification within the basal-like subtype at 31% (101/322) and the highest rate by SCMGENE classification within the ER-/HER2- subtype at 27% (125/463). The lowest rates of pCR by IntClust subtyping were observed within the IntClust 2 subtype at 0% (0/20) compared with the lowest rate by PAM50 classification within the luminal A subtype at 6% (15/265) and the lowest rate by SCMGENE classification within the ER+/HER2-, low proliferation subtype at 8% (4/51). We next conducted a formal comparison of the relative value of each classifier in predicting pCR after adjustment for clinical variables (tumor and lymph node stage and histological grade). We evaluated the discrimination of prediction models using the area under the curve (AUC) from a receiver operating characteristic (ROC) analysis. Odds ratios were based on a logistic-regression model again derived from the largest external study (N = 435)  and subsequently tested in the remaining data (N = 436) in order to avoid over-optimistic estimates. The performance of the three models was very similar and not significantly different, with SCMGENE classification returning an AUC of 0.64 (95% confidence interval (CI) 0.56 to 0.72 PAM50 classification returning an AUC of 0.67 (95% CI 0.60 to 0.75), while the IntClust classifier returned an AUC of 0.66 (95% CI 0.58 to 0.74) (Additional file 11). These data show that IntClust is as accurate a predictor of pCR to cytotoxic chemotherapy as PAM50 or SCMGENE classification.
The landscape of somatic alterations in breast cancer is complex and heterogeneous. This variety is reflected in the diverse clinical behavior of breast tumors and provides critical insight for the development of rational therapies. Therefore, a method for capturing this complexity which can be readily implemented in a clinical setting is urgently required. We have extensively investigated the potential of the IntClust classification to meet this need, in terms of its reproducibility, association with clinical outcome and representation of copy number-driven cancer genes. We find that IntClust subtypes are observable across studies, are significantly associated with clinical outcome and best capture the repertoire of breast cancer genomic drivers. These data provide a compelling rationale for IntClust as a driver-based molecular taxonomy with considerable potential for clinical application. Indeed, a recent clinical trial (SAFIR01) shows that CNAs are the drivers for which targeted therapies are most frequently identified in breast cancer .
IntClust subtypes were observed across studies at comparable frequencies. This important observation demonstrates that these entities are reproducible and represent true breast tumor subtypes. The discovery study used for identifying the IntClust groups comprised 997 tumors from five centers spanning two continents . This approach was adopted in order to accrue a sufficient sample size representative of the whole of the breast cancer population. Therefore, a robust classifier of IntClust subtypes should identify these groups in external studies, just as we have observed. We also note that TP53, one of the two most frequently mutated genes in breast cancer, is mutated at comparable frequencies across IntClust subtypes in both METABRIC and TCGA .
The clinical validity of the IntClust subtypes has here been demonstrated by their association with relapse-free survival and propensity to undergo pCR in studies of neo-adjuvant chemotherapy. An important observation was the recapitulation of survival patterns originally observed in the METABRIC study . This shows that the IntClust subtypes are biologically distinct, readily discernible entities associated with widely variable but predictable clinical behavior. We compared the performance of prediction models which contained either transcriptome-based or IntClust subtypes in their ability to discriminate between patients at higher versus lower risk of disease relapse or resistance to chemotherapy. These models performed similarly. Since the IntClust subtypes were conceived with the intention of best representing breast tumor biology as defined by the genome, survival was not taken into account . It should, however, be noted that an association with survival is not the sole arbiter of the validity of a biological classification. Data-driven approaches designed to generate models for risk stratification of breast cancer patients have largely uncovered proliferation-related genes which, while they are indisputably effective predictors of survival, do not provide additional insight into the biology underlying their expression . Equally, an example of an important disease entity which does not significantly improve prediction of survival is lobular breast carcinoma. Patients with these tumors, which are characterized by single-file morphology and loss of E-cadherin expression, have been convincingly shown to experience patterns of survival indistinguishable from patients with the more common invasive ductal carcinoma , yet the diagnosis of lobular carcinoma is routine, critical for appropriate long-term clinical management and highlights a patient subgroup potentially amenable to novel targeted therapies. A comparable example concerns the distinction between IntClust 2 and IntClust 1. IntClust 2 tumors are characterized by amplification of 11q13/14 encompassing CCND1, EMSY  and PAK1, whereas IntClust 1 tumors harbor 17q23 amplification encompassing RPS6KB1, PPM1D, PTRH2 and APPBP2 . Both subgroups comprise high-risk, mostly ER-positive tumors. The unadjusted 10 year relapse-free survival observed in external studies was 64% for patients with IntClust 1 tumors and 49% for patients with IntClust 2 tumors. However, no tumors in the IntClust 2 subtype underwent pCR (0/20) whereas tumors in the IntClust 1 subtype showed the fourth highest rates of pCR at 20% (15/76). Although these observations require validation, they suggest that in spite of a similar aggressive clinical course, IntClust 2 tumors are chemoresistant in comparison to IntClust 1 tumors. This difference, highlighted by IntClust subtyping and likely attributable to differences in amplification-driven oncogenes, is worthy of further investigation. Here, IntClust 2 tumors represented just 3.1% (298/9,524) of patients; nonetheless, this group experiences some of the poorest survival of all subgroups. This dismal prognosis may, in part, be explained by our observation that IntClust 2 tumors are entirely chemoresistant. These patients warrant consideration of alternative therapeutic modalities and represent a priority for the development of novel targeted therapies. This subtype is not identified by any other breast cancer classification scheme. Such observations highlight the important benefits of rational tumor classification based on molecular drivers.
Based on an independent list of recurrent CNAs in breast cancer and using samples compiled from external studies , we found that the IntClust classification best explains expression levels of genes which fall within these loci. This finding reiterates the nature of IntClust as a biological classification which explains characteristic gene expression profiles in terms of their genomic drivers. We have conducted an unbiased comparison by including all genes that fall within loci reported as recurrently altered by an independent group (TCGA); however, it should be noted that the magnitude of explained variation differed greatly between genes (Additional file 16). The explained variation of a large proportion of genes showed little difference between classifiers whereas a subset showed large differences (Additional file 16). This is likely due to the fact that the majority of genes included within these loci are passengers which do not confer a growth advantage to proliferating tumor cells. Somatic CNAs are a relatively common event among breast cancer genomes and a long-standing problem has been to identify genes which amount to drivers within recurrently altered genomic loci. Although criteria for their characterization have been proposed , particularly for amplified genes, they stipulate multiple lines of independent evidence which require considerable resources and, as such, have not been generated for most loci. Moreover, it is possible that in some instances where a minimal region of amplification contains more than one gene, such as the 11q13/14 locus which defines IntClust 2, that adjacent genes may act in a concerted manner to confer a selective growth advantage just as has been observed in lung cancer . The conception of IntClust was pragmatic in attempting to minimize the influence of passenger genes. Three strategies were employed to this end. First, the discovery study was large (997 samples), enabling reliable identification of regions of recurrent CNA. Second, only the top 1,000 cis eQTLs were included for classification based on the strength of association between alteration in copy number and levels of gene expression. Third, clustering retained only those features which contributed to the separation of tumors into distinct subgroups (754 features) . This approach provides the most definitive scheme for breast tumor classification based on the pattern of copy number-driven genes. It is likely, therefore, that our unbiased comparison of explained variation in the expression of genes within recurrent CNAs underestimates the extent to which IntClust reflects the expression of genomic drivers within these regions. Nonetheless, our analysis does demonstrate that IntClust best captures variation in levels of gene expression of copy number-driven breast cancer genes.
We have developed an expression-based method for classification of breast tumors into the IntClust subtypes. We used this method and public datasets of breast tumor transcriptomes to investigate the validity of IntClust. We confirmed that the IntClust subtypes are reproducible entities, demonstrated their association with clinical outcome and found that IntClust best captures expression patterns of breast cancer drivers. Our method is a powerful tool for independent researchers to investigate the significance of IntClusters. Moreover, our findings highlight the potential of IntClust in the era of targeted therapies. Our classifier lays the foundation for the generation of a clinical test to assign tumors to IntClust subtypes.
We modified the method for IntClust classification which was originally reported for subtype validation . Probes were re-annotated to hg19 and some eliminated because of ambiguous genomic matching (where a probe sequence matched to more than one position in the reference genome). Some genes were represented by more than one probe, reflecting the design of the Illumina beadarray ht12v3 microarray, in which probes can represent different parts of a gene. Our method followed three steps in classifying a new set of samples. In the first step features were matched. Copy number features were matched either by genomic position or gene name, while expression features were matched by probe name (METABRIC study) or gene name. This was performed by the function matchFeatures. In the second step data were normalized to the distribution of the METABRIC discovery set. We scaled each gene to a z-score. This was achieved using the normalizeFeatures function. The function also implements other normalization methods from the CONOR R package . In the third step a classifier was trained using the probes that were matched using the pamr R package , based on shrunken centroids. The optimal threshold was chosen by cross-validation, so different runs produced slightly different classifications unless we set a random number seed. That is, centroids were re-estimated based on the features available in different platforms against the METABRIC discovery dataset for each of the 10 clusters. The iC10 function was used for this step.
Several quality statistics were included as part of our method for inspection of results. A goodness of fit, which was a Pearson correlation coefficient, was computed. It represented the correlation between the average (across all samples) gene expression profile for each cohort and the centroids from the training data set, within each IntClust subtype for those genes where data were available in the external study. In short, the statistic represents a measure of the similarity, in terms of gene expression, of IntClust subtypes from external studies compared with the training data set. We plotted centroids in order to inspect their representation within each subtype in the test dataset - several functions are included in the iC10 package to achieve this. We have made our method freely available for download as an R package under the name 'iC10' at CRAN .
We applied this method to breast cancer gene-expression datasets available in public repositories. A large proportion of these studies had previously been compiled and curated by Haibe-Kains et al.  and we downloaded these data directly from the authors’ website . Additional details, including Gene Expression Omnibus (GEO) accession numbers of included studies are detailed in Additional file 2. It is possible that data for some patients have been uploaded more than once, particularly if those patients participated in more than one study. We took three precautions against inadvertent inclusion of replicate records in our analyses: 1) only cases with a unique GEO identifier were retained; 2) cases identified by Haibe-Kains et al. as replicates were removed; and 3) cases identified by the doppelgangR package  as replicates based on highly correlated gene expression profiles were further investigated. Those cases which, in addition to correlated gene expression, also showed concordant values for tumor stage, node stage, histological grade and, in the case of neo-adjuvant studies, pCR were also removed. Cases identified as probable replicates by this strategy almost exactly overlapped with those annotated as replicates by Haibe-Kains et al. with only an additional three cases being removed. For each dataset, the iC10 package was run with expression data only (using probe names for the METABRIC study and gene names for the rest) and normalizing each probe to a z-score ('scale' option in the function normalizeFeatures). PAM50 classification was conducted accounting for imbalances in ER status, as defined in . SCMGENE classification was conducted using the genefu package in R, available at Bioconductor.
In order to classify breast cancer cell lines, we used copy number and gene expression data from two collections of cell lines: Sanger COSMIC database and CCLE. Copy number data from the COSMIC database consisted of segmented copy number calls. The CCLE database provided copy number data on 579 genes (optimal IntClust classification requires 612 genes) as the summarized log ratio for each gene. Nevertheless, the fit of the IntClust classifier based on copy number was similar for both datasets (0.74 for COSMIC and 0.75 for CCLE). We noted that some cell lines are characterized by copy number amplification of both ERBB2 (IntClust 5) and 8q24 (IntClust 9), which contains the MYC oncogene. In these cases the classifier mostly assigned an IntClust 9 subtype (HCC1419, HCC1569, MDA-MB-453, OCUB-M, ZR-75-30). As a comparison, 10% (28/268) of primary tumors with amplification of ERBB2 also showed co-amplification of MYC in 1,980 samples from the METABRIC study. Cell lines were also classified in IntClust subtypes based on gene expression alone and combined copy number/gene expression and into PAM50 and SCMGENE subtypes based on gene expression alone.
Associations between subtype and survival were estimated using Cox regression. Of the studies with available time-to-event data, relapse-free survival was available for some and distant metastasis-free survival for others. Our survival time variable comprised relapse-free survival but where this was unavailable distant metastasis-free survival was used.
Comparison of univariable hazard ratios associated with IntClust subtype between the METABRIC (disease-specific survival) and external studies (relapse-free survival) (Figure 3B) was conducted by using IntClust 3 as the referent class, separately for three brackets of follow-up time (0 to 4, 4 to 8 and 8 to 15 years).
Performance of predictive models was assessed as follows: Cox regression models which contained either PAM50 or IntClust as a categorical variable and were adjusted for tumor size (<1, 1 to 2, 2 to 3, 3 to 5, >5 cm), node status (negative versus positive) and histological grade (1, 2 or 3) as continuous variables were fit within the METABRIC study (the largest study) against available time-to-event data (disease-specific survival). These models were stratified by each of the five centers of the METABRIC consortium. Separate models were fit for ER-positive and ER-negative breast cancer within three time brackets (0 to 4, 4 to 8 and 8 to 15 years) in order to investigate differences in model performance in short- versus long-term survival and to account for violations of the proportional hazards assumption. The coefficients derived from these models were then applied to external studies with available data. Comparison of model discrimination in this test population was conducted using the method suggested by Newson  using Harrell’s C-index implemented using the somersd and lincom commands in Stata .
Associations between subtype and pCR were estimated using logistic regression. Logistic regression models comprising either PAM50 or IntClust as categorical variables and adjusted for tumor size (T-stage), positive lymph nodes (N-stage) and histological grade were fit in the largest study of neo-adjuvant chemotherapy . Coefficients derived from these models were then applied to the remaining test data. Model discrimination in the test data was estimated using the AUC from a ROC analysis. These analyses were conducted using Intercooled Stata version 11.2 (Stata Corp, College Station, Texas, USA).
For each gene in each list of amplified and deleted genes, we fitted an ANOVA linear model relating the expression of that gene to IntClust groups or the PAM50 groups. We measured the goodness of fit of these two models using the adjusted R-squared - a measure that accounts for differences in degrees of freedom of the two models when the models have been completely pre-specified . We computed the differences in adjusted R-squared for each gene and averaged them for each gene list. CIs were obtained using 1,000 bootstrap replicates with the percentile method implemented in the package boot . An overall mean for all studies was computed weighting each study by its size. These analyses were conducted using R version 3.1.0 .
Annotated R and Stata code used to generate the reported analyses is provided as Additional file 17.
Data from the METABRIC study is deposited in the European Genome-phenome Archive and can be downloaded from . The IDs for expression are: EGAD00010000210 (discovery) and EGAD00010000211 (validation). The IDs for copy number are: EGAD00010000213 (discovery) and EGAD00010000215 (validation). Details of data sources, including accession codes for all other studies, are provided in Additional file 2.
analysis of variance
area under the curve
Cancer Cell Lines Encyclopedia
copy number aberration
expression quantitative trait locus
Gene Expression Omnibus
human epidermal growth factor receptor 2
Prediction Analysis of Microarrays
pathological complete response
receiver operating characteristic
We acknowledge funding from Cancer Research UK and the National Institute for Health Research (NIHR) funded Cambridge Biomedical Research Centre and Experimental Cancer Medicine Centre, Cambridge. HRA is an Academic Clinical Lecturer funded by the NIHR and supported by a Career Development Fellowship from the Pathological Society of Great Britain and Northern Ireland. We thank Roslin Russell and Alejandra Bruna for their useful comments on the study. The results here are in part based upon data generated by the TCGA Research Network.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.