 METHOD
 Open Access
 Published:
Cancer progression modeling using static sample data
Genome Biologyvolume 15, Article number: 440 (2014)
Abstract
As molecular profiling data continue to accumulate, the design of integrative computational analyses that can provide insights into the dynamic aspects of cancer progression becomes feasible. Here, we present a novel computational method for the construction of cancer progression models based on the analysis of static tumor samples. We demonstrate the reliability of the method with simulated data, and describe the application to breast cancer data. Our findings support a linear, branching model for breast cancer progression. An interactive model facilitates the identification of key molecular events in the advance of disease to malignancy.
Background
Human cancer is a dynamic disease that develops over an extended time period through the accumulation of a series of genetic alterations. Once initiated, the advance to malignancy can to some extent be considered a Darwinian process (a multistep evolutionary process) that responds to selective pressure [1][5]. While the majority of genetic alterations confer no growth advantage, tumor cells that acquire mutations in genes that control key cellular processes can overwhelm less vigorous cell populations within the tumor mass, and this process continues through a series of clonal expansions that result in tumor persistence and growth, and ultimately the ability to invade surrounding tissues and metastasize to distant organs. The delineation of this dynamic process and the identification of pivotal molecular events that drive stepwise progression to malignancy would provide a critical foundation and guide for the development of cancer diagnostics, prognostics and targeted therapeutics.
The assembly of timeseries data collected through repeated sampling across an entire disease process would provide essential information for the elucidation of system dynamics and diseaseassociated genetic regulation [6], and recent advances in genomic technologies has made it possible to study cancer genomes at a scale and cost that allow the design of such studies [7],[8]. However, due to the need for immediate treatment upon diagnosis, it is ethically infeasible to collect timeseries data to study disease progression. An alternative is animal studies where timeseries data could be obtained, but the extrapolation of animal data to human disease is not always appropriate. Moreover, the development of spontaneous human cancer is a process that takes several years, and to overcome the heterogeneous nature of the disease and to minimize random effects, a very large number of samples is required, so animal studies are not economically or logistically viable in this context. Conversely, due to the high incident rate of cancer, and the clinical care protocols in place in developed countries, a huge number of archived tumor specimens is available. For example, it is estimated that about 250,000 new cases of breast cancer will be diagnosed in 2014 in the US [9]. Assuming that cancer cells are derived from normal cells, and that a static sample can be regarded as a snapshot of the dynamic cancer process, then we can pose the following question: Is it possible to construct a cancer progression model using data acquired from static samples?
In this paper, we present a proofofprinciple population study to address the above question. Figure 1 depicts a flowchart of the presented stepwise study. It involves extensive work on algorithm development, computational simulation, disease model construction and validation. By analysis of publicly available data, we demonstrated that through the application of advanced computational techniques it is indeed possible to construct a cancer progression path using massive data obtained from static sampling. Analysis of molecular data from 2,133 breast samples [7] enabled the visualization of highdimensional data structures that provided a framework for the extrapolation of progressive disease associations and trends across tumor samples. A very similar data structure was revealed through the analysis of an independent dataset of 524 samples [8], and mapping of copy number alteration (CNA) frequencies, somatic mutation rates, tumor grade information, and the expression levels of specific key genes supported the validity of the model. Our analysis, for the first time, showed that while breast cancer is a genetically and clinically heterogeneous disease, tumor samples are distributed on a lowdimensional data structure manifold, suggesting that genotypes are not hardwired and can shift over time.
The study provides a framework for the construction of highresolution cancer progression models that can combine all currently available genetic information. Through the visualization of key events in tumor progression, such models can facilitate the identification of pivotal driver genes, improved prognostic stratification systems and potential points of susceptibility for therapeutic intervention. Although in this paper we focus on breast cancer, the developed data analysis strategy is equally applicable to model other cancers and other human diseases where the lack of timeseries data to study system dynamics is a ubiquitous problem.
Results and discussion
Bioinformatics pipeline for cancer progression modeling
A cancer progression path can be mathematically viewed as a complex manifold with multiple branches embedded in a highdimensional genomic space. Our working hypothesis is as follows: as tumors progress toward advanced stages driven by genetic mutations, each static sample leaves a genetic footprint on the path. If the number of samples is sufficiently large, the footprint information collectively would enable us to recover the progression path and thereby to reveal key genetic events that drive the dynamic evolution of cancer to malignancy. To verify the hypothesis, we developed a novel data analysis strategy to analyze and integrate molecular profile data into a model of cancer progression, which is detailed below.
Supervised learning approach to identifying cancer progression related genes
Since it is likely that only a small fraction of genes are involved in the malignant process, the first step towards constructing a model is to identify genes associated with cancer progression. This problem has been extensively studied in the community in different contexts [10][12]. One of the most commonly used approaches is correlation analysis [13],[14]. A gene with its expression levels highly correlated with survival time is likely to play a role in cancer development. However, correlation analysis can only find genes with a linear dependency with survival time. Moreover, by analyzing one gene at a time, it ignores possible interactions among genes. In molecular classification, a commonly used strategy is first to partition patients into a bad or good prognostic group at a predefined end point (usually 5year survival or time to metastasis) and then perform feature selection for classification analysis [10][12],[15],[16]. A major drawback to that approach is that patients with survival times slightly longer or shorter than the end point are put into two different groups. To explore the magnitude information of response variables, we proposed to select relevant genes within a regression framework. The idea has been widely used in the statistics and oncology communities. There are a number of excellent algorithms exemplified by Lasso [17] and its variants, which operate under a linearmodel assumption; however, the dependency between gene expression changes and disease progression is complex and is unlikely to be linear [18]. It is thus more appropriate to formulate it as a featureselection problem for nonlinear regression. However, due to the difficulties of mathematically describing complex data structures, feature selection for nonlinear regression remains a challenging problem.
We developed a new method to address the aforementioned challenge. It is a natural extension of our previous work on feature selection for highdimensional classification problems [19],[20]. The basic idea is to decompose a nonlinear regression problem into a set of linear classification problems and learn feature relevance within a largemargin framework. The development of the algorithm is based on a key observation that a classification problem can be viewed as a degenerated regression problem. To see this, let $D={\left\{\right({\mathbf{\text{x}}}_{n},{y}_{n}\left)\right\}}_{n=1}^{N}{\mathcal{R}}^{J}\times \mathcal{R}$ be a set of training data, where x _{ n } is the nth sample, y _{ n } is the corresponding survival time, and J»N. Without loss of generality, we assume that y _{ n }»y _{ m } if n>m. If we divide the dataset into two subsets at a predefined end point (e.g., 5 years), we end up with a molecular classification problem. As noted above, a major issue with molecular classification is that it ignores the magnitude information of survival times. Then, a natural idea is to partition the data into two parts at multiple end points $S={\left\{{s}_{k}\right\}}_{k=1}^{K}$. A possible choice is s _{ n }=(y _{n»1}+y _{ n })/2 for 2≤n≤N. We seek to select a feature subset so that the overall classification error of unseen test data computed at all end points is minimized. We give below a detailed mathematical description of the proposed method. For the ease of presentation, at the moment, we only consider sample x _{ n } and divide the data into two parts D _{1}={x _{ j }y _{ j }<s _{ k },1≤j≤N} and D _{2}={x _{ j }y _{ j }>s _{ k },1≤j≤N} at end point s _{ k }. We further assume that y _{ n }>s _{ k }. We start by defining a margin for x _{ n }. Given a distance function, we find two nearest neighbors for x _{ n } from D _{1} and D _{2}, denoted as NN(D _{1}) and NN(D _{2}), respectively. The margin of x _{ n } is then defined as ≤ _{ n }(s _{ k })=d(x _{ n },NN(D _{1}))≤d(x _{ n },NN(D _{2})), where d(·) is a distance function. In this study, we used the block distance to measure the similarity between two samples, which was also used in [19],[21]. The above margin is called the hypothesis margin, and can be interpreted as a measure of how much x _{ n } can be corrupted by noise before being misclassified by a onenearestneighbor classifier [22]. By the large margin theory [23],[24], a learning algorithm that minimizes a marginbased error function usually generalizes well on unseen test data. Then, a natural idea is to scale each feature, and thus obtain a weighted feature space, parameterized by a nonnegative vector w, so that a marginbased error function in the induced feature space is minimized. The magnitude of each element of w reflects the importance of the corresponding feature. The margin of x _{ n }, computed with respect to w, is given by ${\rho}_{n}\left({s}_{k}\right\mathbf{\text{w}})={\mathbf{\text{w}}}^{T}({\mathbf{\text{x}}}_{n}\text{NN}({D}_{1}\left)\right\xb7{\mathbf{\text{x}}}_{n}\text{NN}({D}_{2}\left)\right)\Delta {\mathbf{\text{w}}}^{T}{\mathbf{\text{z}}}_{n}(s)$, where · is an elementwise absolute operator. Note that ρ _{ n }(s _{ k }w) is a linear function of w, and the margin thus defined requires only information about the neighborhood of x _{ n }, while no assumption is made on the underlying data distribution. The above analysis can be repeated for each sample at each end point, leading to the following optimization problem:
where I(·) is the indicator function. Since in the inner summation x _{ n } is held out as a test sample, the above formulation can be interpreted as finding a feature subspace where the overall leaveoneout crossvalidation error is minimized.
The main problem with the above margin definition is that the nearest neighbors of a given sample are unknown before learning. With tens of thousands of irrelevant features, which is the case in this study, the nearest neighbors defined in the original space can be completely different from those in the induced space. A possible way to address this issue, proposed by [19], is to use a probabilistic model where the nearest neighbor of a given sample is treated as hidden variables. Following the principles of the expectationmaximization (EM) algorithm [25], we estimated the margin by computing the expectation of ρ _{ n }(s _{ k }w) via averaging out the hidden variables:
where $\mathbf{E}[\phantom{\rule{0.3em}{0ex}}\xb7]$ is the expectation operator, ${\mathcal{\rho}}_{1}=\{j:{\mathbf{\text{x}}}_{j}\rho {D}_{1}\}$, ${\mathcal{\rho}}_{2}=\{j:{\mathbf{\text{x}}}_{j}\rho {D}_{2}\}$, and Q(jn,w) and P(jn,w) are the probabilities of x _{ j } being the nearest neighbors of x _{ n } in D _{1} and D _{2} with respect to w, respectively. The probabilities Q(jn,w) and P(jn,w) can be estimated using the standard kernel method.
After the margins are defined, the problem of learning feature weights can be solved within the large margin framework [23],[24]. Note that the indicator function is nondifferentiable and nonconvex. A commonly used practice to address this issue is to minimize the upper bound of the cost function. We used the hinge loss, which leads to a support vector machine formulation of feature selection for nonlinear regression:
where we imposed an ℓ _{1} penalty on w to obtain a sparse solution, and ℓ is a regularization parameter controlling the sparseness of a solution, which can be estimated through crossvalidation.
Note that z _{ n } implicitly depends on w through Q(jn,w) and P(jn,w). We used a fixedpoint recursion method to solve w. First, we made a guess for w, which was used to calculate pairwise distances d(x _{ i },x _{ j }) and probabilities Q(jn,w) and P(jn,w). Then, the feature weight vector w was updated by solving the ℓ _{1} regularized optimization problem. The two steps were iterated until convergence. We used our recently developed gradientdescentbased algorithm to solve the above optimization problem efficiently [26]. By using the fixedpoint theory [27], it can be proved that the algorithm converges to a unique solution regardless of the initial weights if the kernel width is properly selected. The detailed mathematical derivations are given in the supplementary data, where we also present a simulation study that demonstrates the effectiveness of the new approach.
Twopronged approach to cancer progression modeling
After cancer progression related genes are selected, the next step is to build a progression model. Both cancer evolution theory and our data visualization analyses (shown below) suggest that a progression path and clonal structures are two sides of the same coin (see Figure two in [1], Figure 2 and Additional file 1: Figure S13). This motivated us to develop a novel twopronged method for modeling the cancer progression process. The basic idea is to perform a clustering analysis to detect genetically homogeneous tumor groups, construct a principal curve to summarize the general trend of data, and finally combine the two results using the principal curve as a backbone to build a model of cancer progression. A toy example is given in Figure 3, which illustrates the proposed method.
Clustering analysis
Clustering analysis has been intensively studied in the machine learning community. For the purpose of this study, we used spectral clustering. Compared to modelbased methods (e.g., Kmeans), spectral clustering does not explicitly impose a model assumption on data distribution, and thus is able to detect clusters of unknown shapes [28]. Moreover, by making use of the spectrum of the similarity matrix of the data, spectral clustering provides an embedded mechanism of dimensionality reduction, making it particularly suitable for this study to cluster highdimensional data.
Spectral clustering starts with the generation of a similarity matrix measuring the relative similarity of each pair of samples in the dataset. To this end, we first constructed a mutual Knearestneighbor (KNN) graph based on the profiles of selected genes and copy numbers [28]. On the resulting graph, each vertex represents a sample, and two vertices are connected if the corresponding samples are among the KNNs of each other. Throughout the paper, K was set to 30. We tried different values of K, and they yielded similar results as determined by the normalized mutual information scores [29]. Based on the constructed KNN graph, a similarity matrix S was generated, where the ijth element S _{ ij }= exp(x _{ i }x _{ j }?^{2}/( _{ i }σ _{ j })) if x _{ i } and x _{ j } were connected and 0 otherwise. Here, ${\sigma}_{i}=\sigma {\mathbf{\text{x}}}_{i}\sigma {\mathbf{x}}_{i}^{\left(K\right)}\sigma $ and ${\mathbf{x}}_{i}^{\left(K\right)}$ is the Kth nearest neighbor of x _{ i } [30]. Then, a normalized Laplacian matrix was computed as L=D ^{1/2}S D ^{1/2}, where D is a diagonal matrix with ${D}_{\mathit{\text{ii}}}=\underset{j=1}{\overset{n}{}}{S}_{\mathit{\text{ij}}}$, and the eigenvectors of the matrix L were computed. The number of clusters was determined using the method proposed in [30]. Briefly, for each possible number of clusters C, a rotation matrix that best aligns the columns of the top C eigenvectors with the canonical coordinate system was calculated, and the optimal number of clusters was determined as the one that resulted in the highest alignment quality score. Let $\mathbf{\text{U}}U{\mathcal{R}}^{N\xb7C}$ be a matrix containing the top C eigenvectors as columns. We formed a new matrix $\mathbf{\text{Y}}C{\mathcal{R}}^{N\times C}$ from U by normalizing the rows to norm one [31]. Finally, the Kmeans analysis was performed to group the samples now represented as the rows of the matrix Y into C clusters.
To identify robust and stable clusters, the technique of resamplingbased consensus clustering [32] was used, where Kmeans clustering was repeated 1,000 times and each time 80% of the samples were drawn randomly without replacement from the entire dataset. The results of the 1,000 runs were then aggregated into a consensus matrix, which gives a visual representation of the frequency of two samples being grouped into the same cluster. To assess the clustering robustness further, the silhouette width of each sample was calculated, which is defined as the difference between its average similarity with samples in the same cluster and highest average similarity with samples in different clusters. A cluster with an average silhouette width >0 is considered stable.
Principal curve
We used a principal curve to describe the general trend of the data. Mathematically, a principal curve is a smooth curve going through the center of a data cloud [33],[34]. A toy example is presented in Figure 3. This concept was first proposed by [33] as a generalization of the first principal component line and required a curve not to intersect itself, which is too restrictive for our application. Presumably, a cancer progression path is a highdimensional manifold with multiple branches. Although a dozen algorithms have been developed in the past two decades, there is currently no method that can be used effectively to extract a selfintersected curve embedded in a highdimensional space (see [35] for an excellent review). For the purpose of this study, we developed a new method that addressed some limitations of existing approaches.
We start by describing a probabilistic model responsible for generating a set of random observations $D={\left\{{\mathbf{\text{x}}}_{n}\right\}}_{n=1}^{N}\in {\mathcal{R}}^{M}$. We assume that each sample is generated from an unknown curve in a twostep process. First, a point μ _{ s } is randomly selected from the curve according to a probability density function p(s), and then a sample x is generated from μ _{ s } corrupted by Gaussian noise $\mathcal{N}\left(\mathbf{\text{x}}\right{\mathit{\mu}}_{s},{\mathit{\mu}}_{s})$, where ∑ _{ s } is a covariance matrix and s takes a value from a set Ω. We herein do not specify any form for Ω to reflect the fact that the curve may contain branching structures. Due to the extremely high data dimensionality (M=1,140), it is numerically unstable to estimate full covariance matrices. We thus used a simple heuristic by assuming ∑ _{ s }=θ ^{2}I, where I is an identity matrix. Although we may lose some resolution, the simplification can lead to a numerically stable estimation and worked well for our study. Note that the data generalization mechanism is similar to that assumed in Gaussian mixture modeling [36], with the latter using a set of discrete data points placed at the centers of clusters, instead of a curve, to represent the data. Therefore, principal curve fitting can be viewed as a natural extension of Gaussian mixture modeling.
Let θ be the parameters {μ _{ s },p(s)}_{sθθ} and θ that we aim to estimate. The loglikelihood function of the data is given by
The parameters can be estimated by maximizing the loglikelihood function using the EM algorithm [25]. A major issue with the maximum likelihood (ML) estimation is that it is an illposed problem and can lead to severe overfitting [34],[36]. Specifically, when ℓ goes to zero, the loglikelihood achieves the maximum value of infinity, resulting in a trivial solution where the obtained curve goes through every data point. One possible remedy for the problem is to add a regularization term to the loglikelihood function to encourage curve smoothness [34]. This strategy, however, works only for simple cases where a curve is not selfintersected. Moreover, it is difficult to determine the regularization parameter used to control the tradeoff between data fitting and curve smoothness. It has been shown that crossvalidation is not a viable method for controlling the complexity of principal curve estimates [37].
We addressed the issue by exploiting the analogy between principal curve fitting and Gaussian mixture modeling discussed previously. Since the estimation of ℓ causes the overfitting problem, we treated ℓ as a userdefined parameter and only performed the ML estimation on ℓ _{ s } and p(s) for a given parameter. Now the problem becomes how to determine the value of ℓ. It can be shown experimentally that with a decrease of ℓ, the curve complexity, which is measured as the total curve length, increases and the data fitting error decreases monotonically (Additional file 2: Figure S9). In our implementation, a continuous curve was discretized into multiple equally spaced segments, and thus the curve length can be directly translated into the number of segments. The problem now becomes how to estimate the optimal number of segments, which is similar to that of estimating the number of clusters in Gaussian mixture modeling. There is a large body of work on this topic in the clustering literature [36],[38],[39]. In this study, we used the elbow method [39] for parameter estimation. Additional file 2: Figure S9 shows the curve fitting error measured as the sum of squared distances between data points and their corresponding closest points on a curve versus the curve length. The fitting error decreases monotonically as the curve length increases, but at a certain point the decrease flattens markedly. To perform the estimation automatically, we fitted a regression model consisting of two lines to the two arms of the elbow curve and estimated the data variance as the one that generated a curve with a length equal to that at the intersection of the two lines (Additional file 2: Figures S9, S10 and S11). A detailed mathematical derivation is given in the supplementary data, where we also present a discussion of some implementation issues and a simulation study that demonstrated the effectiveness of the proposed method.
The developed bioinformatics pipeline also contains some standard statistical techniques, including principal component analysis (PCA), survival data analysis, polynomial curving fitting and hypothesis testing. We refer the reader to [40],[41] for detailed descriptions. The core algorithms of the proposed bioinformatics pipeline are available upon request.
Breast cancer progression modeling
Identification of breast cancer related genes and data visualization
We demonstrated the utility of the developed bioinformatics pipeline by applying it to a largescale breast cancer dataset generated by the METABRIC project [7]. The dataset contains the expression levels of 25,160 genes and copy number data of 30,566 genes from 1,989 tumor samples, and has clinical followup data ranging from 0 to 25 years^{a}. Molecular profiles were obtained from surgically excised primary breast tumors. The majority of estrogen receptor (ER)positive/lymph node (LN)negative patients did not receive chemotherapy, whereas ER /LN+ patients did. Additionally, none of the HER2+ patients received trastuzumab. As such, the treatments were homogeneous with respect to clinically relevant groups. The dataset also contains the gene expression data of 144 normal breast tissue samples. To minimize the confounding factor of censoring in selecting relevant features, we removed 845 samples from the tumorbearing cohort that had clinical followup data of less than 10 years in the featureselection analysis. Gene expression and copy number data of each sample were then stacked together to form a highdimensional vector, and robust linear scaling [42] was performed on each feature so that the 2% and 98% quantiles were set to 0 and 1, respectively. As such, all features were comparable and outlier data were removed. We then applied our new featureselection method (described above) to identify genes associated with cancer progression. Due to the use of an ℓ _{1} regulation on feature weights (see Equation 3), the method offers an embedded mechanism to remove irrelevant and redundant features by assigning them zero weights. In this study, features with weights larger than 10^{3} were selected for the downstream analysis. Using the regularization parameter estimated through tenfold crossvalidation, the analysis identified a total of 1,140 features including 989 genes selected from differential mRNA expression data and 151 genes from copy number data.
To obtain a general picture of data distribution supported by the selected features, we first performed a data visualization analysis of the entire 2,133 samples in the METABRIC dataset using PCA and KNN graphic analysis (Figure 2 and Additional file 1: Figure S13, Additional file 3: Movie S1). PCA is a commonly used data dimensionality reduction technique, which projects highdimensional data onto a threedimensional space where the data geometric structure is least distorted in a leastsquare sense [36],[40]. For ease of presentation and discussion, we annotated each sample with its corresponding PAM50 label. PAM50 is a 50gene predictor developed from microarray data that classifies breast cancer samples into intrinsic subtypes including normallike, luminal A, luminal B, HER2+ and basal [43]. We used the 144 normal breast tissue samples as the baseline. From Figure 2 and Additional file 3: Movie S1, we can see that although the normal samples were not used in feature selection, they are first connected with normallike and luminal A samples, gradually transit to luminal B samples and form a bifurcation structure leading to either basal or HER2+ samples. Basal and HER2+ tumors are known to be the most aggressive breast cancer phenotypes [44],[45]. Note that the three leading principal components account only for 28% of the total information as shown in Movie S1, and consequently some detailed structure is bound to be lost. Therefore, we generated a mutual KNN graph to visualize the data (Additional file 1: Figure S13). On the generated graph, each node represents one sample and two nodes are connected if they are among the KNNs of each other. Although a KNN graph is more sensitive to noise than PCA, one advantage of a KNN graph is that the node connectivity can help to reveal data clustering structures and thus it is commonly used as a preprocess step in spectral analysis [28]. The data visualization result appears to align well with the cancer evolution theory that posits that tumor progression is an evolutionary process akin to Darwinian evolution at the organism level, with clones as the equivalent of genetically distinct quasispecies [1][3]. If the theory is valid, a cancer progression tree would consist of a series of connected, bifurcating clusters, and this is what the PCA and KNN graphs illustrate.
Construction of a breast cancer progression model
The data visualization analysis provided an overview of data distribution, and informed the design of the novel twopronged method used to model the cancer progression process formally. Application of the spectral clustering method [28] to the METABRIC data revealed 13 distinct clusters (Figure 4A). To promote a robust clustering assignment, a resamplingbased consensus clustering analysis [32] was performed. From the generated consensus matrix shown in Figure 4B, we can clearly identify 13 diagonal blocks, which suggests that the clustering assignment is very stable. This result was further confirmed by silhouette width analysis. The clustering analysis classified 1,900 out of 1,989 (96%) samples with a positive silhouette width and yielded an average silhouette width of 0.47 (Figure 4C). Cluster 11 contains only three samples and thus was omitted in downstream analyses. The second step was to extract a principal curve to define mathematically the general trend of the data. To overcome the difficulty of extracting a selfintersected curve embedded in a highdimensional space, we applied our new principal curve method (described above). The parameter was estimated using the elbow method [39] (Additional file 2: Figure S11). Finally, we combined the clustering and principal curve results using the principal curve as a backbone to build a breast cancer progression trajectory (Figure 5). Each node on the figure represents an identified cluster and the node size is proportional to the number of samples in the corresponding cluster. Two connected nodes indicate a possible interrelationship, and the length of an edge connecting two nodes is proportional to the distance between the centers of the two nodes. We note that the overall structure of the model is consistent with the results of our data visualization analysis, suggesting that the constructed model faithfully reflects the data distribution.
Our analytical approach revealed a linear, bifurcating structure within the METABRIC dataset. There appear to be two major paths from tumor initiation to malignancy. Intuitively, both paths transition through normal and normallike tumor phenotypes towards the luminal subtypes. The linear path continues through luminal A and luminal B phenotypes but then bifurcating paths lead from luminal B to either the basal or HER2+ phenotypes (Figure 5A). No evidence for a significant interrelationship between basal and HER2+ phenotypes was revealed. Beyond the links denoting the major two pathways, we also observe a number of minor side branches. In one case, cluster 9 (normallike enriched) has a path going directly to cluster 10 (basal enriched) (Figure 5A). Although a small cluster, this suggests that shortcuts to the most malignant phenotypes are possible. Other minor branches emanate from the luminal A node. These represent subdivisions of the major cluster 6 and are in line with the finding that the luminal subtypes are genetically heterogeneous and may have multiple subtypes within them [46],[47]. It is tempting to speculate what these minor clusters represent biologically given that the paths appear to be dead ends in an evolutionary context. Perhaps, they represent tumor subtypes that remain dormant or are very slow growing, but this requires further investigation experimentally.
Survival data analysis
To examine the relationship between clinical outcome and the groups on the two major paths to malignancy, we used the KaplanMeier method [48] to plot overall survival (OS). Figure 6 illustrates a clear trend of worsening survival function associated with progression along the major trajectory through luminal types to basal or HER2+ tumors (cluster 8 through cluster 6, and cluster 1 to either cluster 3 or 5). As would be expected, each cluster, or node, on this linear path generally had a worse OS index than the preceding cluster. Interestingly, cluster 9, located at the start of the linear path between normal samples (no OS data) and the first luminaltype cluster (cluster 8), has a worse OS function than downstream cluster 8. Similar associations with outcome have been reported for this group in other studies [49]. Cluster 9 was classified as normallike by PAM50 labeling, and there has been conjecture about whether this is an artifact of contamination by high levels of normal tissue in this early stage tumor [50]. The position of the cluster on the progression model may support that notion. A thorough histological investigation of this class of tumors would be needed to resolve this issue. A more plausible explanation is that cluster 9 is connected with cluster 10, which was classed as basal and had a poor OS as shown in Figure 6. This means a subset of tumors in cluster 9 can bypass luminal intermediates and progress to either cluster 10 or 8 directly. Another pattern to note from the KaplanMeier plot is the OS data for cluster 4 (Figure 6). This cluster is located at the bifurcation point on the linear model. The OS function for cluster 4 is similar to that for the basal and HER2+ tumors early on, and continues to mimic HER2+ throughout the survival analysis. This implies that pivotal gene activities associated with outcome are acquired at this stage prior to final commitment to a basal or a HER2+ phenotype.
It is worth remembering here that the PAM50 labels were only added to the model after construction to aid in visualization and to help to put the model into context by referral to previous breast cancer classification systems [43]. The analytical approach used in this study identified 13 clusters and revealed associations between them based on statistical significance without any prelabeling process. Thus, the labels are somewhat misleading and, as shown by the continued subdivision of PAM50 subtypes [46],[47],[51], do not represent the full complexity of the data structures within tumor tissue molecular profiles. PAM50 labeling works well for the HER2+ and the basal types (clusters 3 and 5 respectively) where the majority of samples in the cluster fit that classification system, but even in these clusters there are admixtures of what would be labeled as basal or HER2+ samples. For the sidebranch clusters, the PAM50 admixtures are more pronounced (Additional file 1: Figure S14), nonetheless, the colorcoding to some extent aids the visualization of the two major pathways to malignancy in the progression model. If confirmed in independent cohorts, identification of the genetic changes associated with the interconnected clusters revealed in this study may facilitate the refinement of disease classification systems.
As there is currently no established breast cancer progression model for direct comparison, it is important to consider ways to demonstrate the validity of the constructed model. The following sections describe a series of interrogations that provide support for the proposed model and show the utility of such a model for testing and generating hypotheses and providing novel insights. The model revealed two major progression paths; these will be referred to as NB (normal through luminal to basal phenotype), and NH (normal through luminal to HER2+ phenotype) in downstream analyses.
Comparison with conceptual cancer progression models
There have been two major conceptual models proposed regarding the origins of breast cancer subtypes and associated biological mechanisms (see Figure 6 in [47]). One model proposes a distinctpath scenario where each subtype follows a path of initiation and progression independently of the others. The alternative is a linear evolution model, which proposes that tumors gradually evolve from normal cells to malignant states through the accumulation of genetic alterations. While both models embrace the notion of cancer evolution, an important implication from the first model is that the subtypes are considered as different diseases, and the alternative theory proposes that subtypes are different stages of the same disease. Clarifying this issue could have a profound impact as research strategies used in the two scenarios could be very different. The bifurcation structure revealed in our model supports the linear evolution model as a representation of the breast cancer progression process. We should emphasize that our method is a generic approach without making any model assumption on data. If the four major subtypes evolve directly from normal cells, as proposed by the discrete evolution model, in a population study with a large number of samples, we should be able to detect four independent paths connecting normal samples with the four subtypes, but this was not the case (Figures 2, 5 and 7). Our result suggests that basal subtypes are derived from the luminal subtypes, an idea that has been recently suggested through experimentation [52]. The idea that HER2+ phenotypes are derived from luminal B tumors also makes biological sense. Through association of CNA data and putative driver gene expression (data not shown), we found that the copy numbers of the genes involved in the HER2 signaling pathway are significantly amplified in HER2+ samples relative to luminal B samples, suggesting that the HER2+ phenotype develops from luminal B through gene copy number variation, and that this event is distinct from progression to basal phenotypes. This result echoes recent studies that demonstrated that cancer subtypes are not hardwired, and genotypes and phenotypes can shift over time [1].
Mapping of tumor grades onto the progression model
We next investigated how grades would be mapped to the two major branches of the constructed progression model. Histological tumor grade is a measure of the extent of abnormal tumor cell morphology relative to normal cells [53]. Since the determination of histological intermediategrade tumors is somewhat subjective, a method to derive a molecular grade index has been developed [54], and the information required to derive this index was available in the METABRIC dataset. We found that both the molecular and histological grades were highly correlated with both of the major progression paths. Figure 5B plots molecular grade against the samples on the model’s NB progression path and reveals that there is a clear increase in grade along the path to the most malignant basal phenotype samples. Likewise, for the NH path (Figure 5B) a clear trend in molecular grade towards the malignant HER2+ samples was observed. Statistical significance was determined by Spearman’s test. Briefly, each sample was projected onto the specific progression path (the projection of a sample is defined as a point on the path that is the closest to the sample; see Additional file 2: Figures S8 and S10), and then a Spearman correlation analysis of histologic/molecular grade was performed against the NB/NH paths. Strong correlation was observed for both molecular grade (NB: r=0.89, P=0; NH: r=0.84, P=0) and histological grade (NB: r=0.61, P=0; NH: r=0.53, P=0; see Figure 5D).
The data support the validity of the progression model in that statistically significant correlations were identified, but also because it aligns with established grade associations. Nearly all basal and HER2+ tumor phenotypes are aggressive grade3 cancers and the majority of luminal A tumors are low grade1 cancers [55]. In itself, this is difficult to explain within a discrete disease evolution model as it implies that basal and HER2+ tumors are highgrade cancers from the outset. The correlation aligns well with a linear evolution model. Interestingly, the mapping of the grades onto the model also supports a malignancyassociated transition from luminal A to luminal B phenotypes. Luminal B tumors are routinely graded higher histologically than luminal A tumors, and grade3 rates (and outcomes) approach those seen in the more aggressive HER2+ and basal phenotypes [47],[55]. It has been proposed that this phenomenon reflects a yet more complex inventory of subtypes within the luminal B phenotype [47], but it could also be explained by a progressive transition from luminal A to luminal B and then on to the aggressive HER2+ or basal tumor phenotypes.
Model validation using an independent breast cancer dataset
To investigate whether a similar model would be derived from an independent dataset, we performed a computational analysis on data obtained from The Cancer Genome Atlas (TCGA) breast cancer project [8]. As such, we have used data from the two most comprehensive breast cancer projects conducted to date. In the TCGA project, a total of 507 tumor tissues and 17 normal tissues were subjected to gene expression and copy number profiling. It is often difficult to perform direct comparisons between large projects because the data sources are not entirely compatible. In this case, the TCGA study employed different microarray platforms, and the clinical followup was markedly shorter (median overall followup was 17 months vs 98 months for the METABRIC data, and there were only 90 overall survival events), so feature selection on this dataset would be underpowered. Instead, we mapped the genes selected from the METABRIC data analysis back to the TCGA data to perform the clustering and principal curve analyses and model construction. A total of 775 of the 989 genes selected from the METABRIC dataset were also found in the TCGA data. On application of the same twopronged analytical protocol, nine stable clusters were identified (Additional file 1: Figures S15 and S16). The difference in the numbers of clusters found in the two studies can be attributed to various factors including the different sample sizes, the different microarray platforms used, and the partially overlapped gene sets. Despite those differences, the major structures of the progression model constructed using the TCGA data (i.e., the bifurcation structure and the order that the clusters are connected) were almost identical to those constructed using the METABRIC data (Figure 7). As with the METABRIC data model, luminal A clusters represented the largest group in the cohort, there were four luminal A type and two luminal B type clusters in series, and there were some deadend side branches emanating from the luminal phenotypes. Because only 8 out of 507 tumor samples were classified as normallike, no distinct cluster was identified on the TCGA progression tree. Survival curves for the clusters in the TCGA model could not be constructed due to the lack of followup data. Consistent with the results obtained from the METABRIC progression model, the progression model constructed using the TCGA data was also significantly correlated with the molecular grade index (Spearman’s test, NB: r=0.85, P=0; NH: r=0.81, P=0). Of note, the magnitudes of the correlation obtained for both METABRIC and TCGA models were comparable (see Figure 5D). No histological grade data was available for the TCGA data.
Mapping genetic alterations onto the progression models
A bonus of the TCGA project was the availability of the overall and nonsilent mutation rates for each sample. This information, together with the frequency of CNAs, was used to test model validity. Here, the CNA frequency of a sample is defined as the sum of the magnitude of CNAs including amplification and deletion of all genes in the sample. It is widely accepted that cancer development is due to the accumulation of genetic alterations in somatic cells [1][4]. Among them, somatic point mutations and CNAs play a central role in tumorigenesis [56],[57]. If our constructed breast cancer progression models are valid, we would expect somatic mutation rates and CNA frequencies to be positively correlated with the modeled progression trajectory. To investigate this, we mapped the overall mutation rate and CNA frequency of each sample back to the two major progression branches revealed by the TCGA model and found that this is indeed the case for both somatic mutation rate (NB: r=0.46, P=1.1 × 10^{12}; NH: r=0.50, P=1.2 × 10^{13}; see Figure 7C) and CNA frequency (NB: r=0.54, P=0; NH: r=0.56, P=0; see Figure 7D). A similar result was observed for nonsilent mutation rates (Figure 5D). We also found that patient age had no significant correlation with the progression model (Spearman’s test, NB: r=0.07, P=0.32; NH: r=0.06, P=0.31; see Figure 5D)^{b}. This finding means that the significant correlation between genetic alteration rates and the two major progression branches are statistically independent of patient ages. This is in line with recent findings that showed that although tumors originating from many selfrenewing tissues have increasing mutation rates with age, no such correlation has been found in breast or ovarian cancer [58]. Consistent with the results obtained from the TCGA model, the progression model constructed using the METABRIC data was also significantly correlated with the CNA frequency (Spearman’s test, NB: r=0.53, P=0; NH: r=0.50, P=0; see Figure 5C,D). No somatic mutation data were available for the METABRIC data.
Genome instability is generally referred to as an enabling characteristic of cancer progression [5]. The significant correlation of both somatic mutation rate and CNA frequency in the progression models built from two independent datasets provides strong evidence supporting the validity of the proposed model.
Mapping of key genes onto breast cancer progression paths
Next, we investigated the change in gene expression of some key genes during breast cancer progression by mapping them onto the model derived from the more comprehensive METABRIC dataset. Our initial interrogation mapped the expression of seven key genes (AURKA, PLAU, STAT1, VEGF, CASP3, ESR1 and ERBB2), which represent described hallmarks of cancer, namely proliferation, tumor invasion and metastasis, immune response, angiogenesis, apoptosis, and estrogen (ER) and HER2 signaling, respectively [59]. The resulting plots revealed the change of expression of these genes along both linear paths, normal to HER2+ (NH, Additional file 1: Figure S17) and normal to basal (NB, Additional file 1: Figure S18). By normalizing the expression levels, we were able to visualize the changes on a single overlay plot (Figure 8). The curves were generated using the polynomial curve fitting method and the degree parameter was estimated through tenfold crossvalidation [40]. Once again, we label the axis with PAM50 subtype labels to indicate an approximation of tumor subtypes for ease of discussion.
As would be expected, the ERBB2 gene is only highly expressed in the HER2+ cluster on the NH path (Figure 8, left), suggesting ERBB2 amplification is a late onset event in tumor progression. As also expected, the estrogen receptor ESR1 is low in normal phenotypes, increased in luminal phenotypes that are primarily classified as ER+, and lost in the most malignant basal and HER2+ phenotypes. On the NB path plot (Figure 8, right), we see that basal phenotypes are ER/HER2. Most triplenegative tumors (ER /HER2 /PR ), which carry a particularly poor prognosis, have a basal phenotype [60]. VEGF expression increases from the normal to luminal phenotypes, but then plateaus across luminal phenotypes until the transition to malignant HER2+ or basal phases. This may suggest that angiogenesis is a limiting factor for progression from luminal phenotypes, perhaps indicating a potential point of efficacy for antiangiogenic therapy. As an indicator of tumor invasion, higher PLAU expression is expected at the malignant disease stages, and this is observed for both progression paths, but the correlation is less pronounced than for other indicators. The plots reveal that increasing AURKA expression is very strongly correlated with progression to malignancy. Intuitively, we might expect that proliferation will increase in an overall sense as cancer develops, but even malignant tumors are relatively slowgrowing from a cellular mass point of view, and so tumors have to overcome opposing events to achieve a net growth. This is supported by the fact that the expression of CASP3, an indicator of apoptosis [61], also increases in parallel with AURKA as tumors progress. AURKA has been reported to be associated with outcome as the key gene in a proliferation module [59] and as part of a threegene molecular signature for subtype classification of tumors [62]. Those studies provided support for AURKA gene expression as a classifier or a biomarker for prognosis using associations with particular subtypes, but the positive correlation across NH and NB paths revealed by plotting the progression model provides further support for AURKA monitoring as a powerful biomarker of tumor status and prognosis. Likewise, the strong correlation of expression of STAT1 suggests that this factor could also act as a marker of progression status and outcome. The transcription factor STAT1 has been thought of as an oncosuppressive factor, acting through the stimulation of antiproliferative and proapoptotic genes in tumor cells, but recent evidence supports a multivalent role for this factor in advancing cancer. STAT1 expression is upregulated in a number of latestage cancers, including breast cancer [63][65], and high STAT1 levels have been correlated with poor survival, presence of metastases and with chemo and radiotherapy resistance in breast cancer patients [66],[67]. Potential functions of STAT1 in aggressive cancers include the maintenance of prosurvival genes [68] and the regulation of the local immune response [65]. There are conflicting reports of STAT1 expression levels in breast cancer subtypes [59], but in the large METABRIC dataset, STAT1 expression was significantly increased in the malignant phenotypes.
We also explored the potential association of reported cancer driver genes with breast cancer progression using the same model interrogation approach. The 125 driver genes selected have been defined through largescale mutational analyses [69], but changes in their expression level may also enhance or inhibit some aspect of tumor progression. A total of 31 genes were found to have significant changes in expression across either path of the progression model (Additional file 1: Figures S19 and S20 and Additional file 4: Table S1). From the individual plots, we see that a few genes have a distinctive association with progression. Of note are the opposing curves of EGFR and NOTCH1 versus GATA3 and RET. The former pair have Ushaped curves depicting higher expression in normal and malignant phenotypes, whereas the latter pair have pronounced bellshaped curves that mirror the EGFR/NOTCH1 expression profile. These genes may have a coordinate expression relationship, being regulated by each other, or perhaps through the estrogen receptor, which has a similar profile to GATA3/RET across the progression model, and such networks are beginning to be elucidated [70]. Other notable plots include EZH2, which is expressed at progressively higher levels on both paths to malignancy, and SMAD4 and KIT, which similarly decrease during transition from normal to malignant states. The only driver gene to have a distinct spike in expression in the malignant phenotypes was CDKN2A. This gene encodes the p16INK4a tumor suppressor protein, which regulates the cell cycle, and so the marked increase in expression in both basal and HER2+ phenotypes is intriguing. Interestingly, recent studies have shown that in breast cancer, high expression is indicative of a more undifferentiated phenotype [71], and elevated CDKN2A has been proposed as a marker of poor prognosis [72]. Our progression model supports these recent reports. While the driver gene analyses do not address the mutational status of the specific genes, the level of their expression is of interest because it will impact the potential effect that the driver mutations may have on tumor phenotype.
Discussion
Overall, our findings support the idea that cancers acquire the properties required for progression towards malignancy from an accumulation of genetic aberrations, a phenomenon that is consistent with clonal evolution theory. This process occurs over extended time periods, measured in years, and so accurate mapping is logistically very challenging. From cancer patients, we typically only have data from a single time point through sampling at the time of surgery, and this restriction is unlikely to be overcome for the majority of solid tumors. While it may subsequently be possible to reconstruct the development of malignancy through a longitudinal study of a few select patients, to overcome issues of heterogeneity, a large number of tissue samples is possibly needed first to construct a theoretical model of progression. This can only be achieved through the analysis of available static samples. A computational approach that can overcome the sampling limitation and thereby enable the leveraging of accumulating data and the vast tissue archive represents a major advance with respect to the application of bioinformatics methodology to the study of progressive human diseases.
There are many potential applications for the proposed approach. One of them is to identify genetic alterations that drive tumor progression, which is one of the central goals of some recent largescale cancer studies [7],[8],[73][76]. The primary analytical strategy used today is prevalencebased approaches that search through a large number of tumor samples for genetic changes that occur with a higher frequency than would be expected by chance alone [73],[74]. While the associations between genetic events and disease can be revealed by existing methods in an overall sense and can even be associated with a putative tumor subtype, the placement of molecular events onto a cancer progression map through a progressionmodelbased approach provides a way to put these molecular events into the context of a dynamic disease process and actually enables us to determine what changes are pivotal to transition from one phenotype to another. If confirmed, such a model could provide a foundation for the visualization of key progressive molecular events and facilitate the identification of pivotal driver genes and pathways, the most relevant and robust biomarker signatures, and potential points of susceptibility for therapeutic intervention. Moreover, interrogation of the model will allow researchers to test novel hypotheses in silico and thus help to prioritize resources for more focused and detailed investigations experimentally.
The presented study has demonstrated the possibility of using static sample data to study disease dynamics, which lays a foundation for future studies to incorporate all currently available molecular data (e.g., mRNA, microRNA, copy number, methylation and protein expression data) to construct highresolution cancer progression trees. Our ongoing work will thus focus on addressing the challenges associated with processing massive, highdimensional data of different types to build models of increasing resolution. This in turn will further facilitate the elucidation of clinically and biologically pertinent issues. The described computational methods can also be further improved accordingly. For example, in the proposed EMbased principal curvefitting algorithm, we assumed that the data was corrupted by noise with an equal variance throughout the genomic space. This heuristic approach worked well, but to build higher resolution models, we are working to develop an algorithm using datadriven variance estimation. In this study, we used a twopronged approach consisting of spectral clustering and principal curve to model a cancer progression path. Incorporation of other data dimensionality techniques (e.g., selforganizing map [77] and Isomap [78]) may also aid the production of a lowdimensional representation of the input space of cancer samples. We expect that the results would be similar, but an intensive computation study is needed to assess the utility of other techniques for cancer progression modeling.
Conclusions
We describe the derivation of a novel computational approach for mapping the development of cancer towards malignancy. Through application to two independent, largescale breast cancer datasets, we have shown that the proposed method can reconstruct tumor progression through the analysis of static samples and thereby identify genetic events associated with pivotal shifts in phenotype. This new set of tools will enable the construction of highresolution progression trees for cancers and other diseases for which longitudinal data sampling is ethically or logistically not possible. Refinement of progression trees will facilitate the identification of molecular drivers of disease progression and the derivation of robust biomarker signatures for patient evaluation and management.
Endnotes
^{a} The data is available in the European GenomePhenome Archive with the accession number [EGAS00000000083].
^{b} Based on a rule of thumb used in the statistical community, two variables with a correlation larger than 0.2 and smaller than 0.2 are considered as having no or negligible relationship.
Authors’ contributions
YS and SG designed the study. YS and JY performed the data analysis. YS, JY, NJN and SG wrote the manuscript. All authors read and approved the final manuscript.
Additional files
Abbreviations
 EM:

expectation maximization
 ER:

estrogen receptor
 KNN:

K nearest neighbor
 LN:

lymph node
 miRNA:

microRNA
 ML:

maximum likelihood
 NB:

normal through luminal to basal phenotype
 NH:

normal through luminal to HER2+ phenotype
 OS:

overall survival
 PCA:

principal component analysis
 TCGA:

The Cancer Genome Atlas
References
 1.
Greaves M, Maley CC: Clonal evolution in cancer . Nature. 2012, 481: 306313. 10.1038/nature10762.
 2.
Yates LR, Campbell PJ: Evolution of the cancer genome . Nat Rev Genet. 2012, 13: 795806. 10.1038/nrg3317.
 3.
Podlaha O, Riester M, De S, Michor F: Evolution of the cancer genome . Trends Genet. 2012, 28: 155163. 10.1016/j.tig.2012.01.003.
 4.
Stratton MR, Campbell PJ, Futreal PA: The cancer genome . Nature. 2009, 458: 719724. 10.1038/nature07943.
 5.
Hanahan D, Weinberg RA: Hallmarks of cancer: the next generation . Cell. 2011, 144: 646674. 10.1016/j.cell.2011.02.013.
 6.
BarJoseph Z, Gitter A, Simon I: Studying and modelling dynamic biological processes using timeseries gene expression data . Nat Rev Genet. 2012, 13: 552564. 10.1038/nrg3244.
 7.
Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, Graf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, Langerod A, Green A, Provenzano E, Wishart G, Pinder S, Watson P, Markowetz F, Murphy L, Ellis I, Purushotham A, BorresenDale AL, Brenton JD, Tavare S, Caldas C, et al: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups . Nature. 2012, 486: 346352.
 8.
The Cancer Genome Atlas Network: Comprehensive molecular portraits of human breast tumours . Nature. 2012, 490: 6170. 10.1038/nature11412.
 9.
Siegel R, Ma J, Zou Z, Jemal A: Cancer statistics, 2014 . CA: A Cancer J Clin. 2014, 64: 929.
 10.
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer . Nature. 2002, 415: 530536. 10.1038/415530a.
 11.
Sun Y, Goodison S, Li J, Liu L, Farmerie W: Improved breast cancer prognosis through the combination of clinical and genetic markers . Bioinformatics. 2007, 23: 3037. 10.1093/bioinformatics/btl543.
 12.
Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer . Lancet. 2005, 365: 671679. 10.1016/S01406736(05)179471.
 13.
Porter PL, Malone KE, Heagerty PJ, Alexander GM, Gatti LA, Firpo EJ, Daling JR, Roberts JM: Expression of cellcycle regulators p27 ^{kip1}and cyclin E, alone and in combination, correlate with survival in young breast cancer patients . Nat Med. 1997, 3: 222225. 10.1038/nm0297222.
 14.
Slamon DJ, Clark GM, Wong SG, Levin WJ, Ullrich A, McGuire WL: Human breast cancer: correlation of relapse and survival with amplification of the HER2/neu oncogene . Science. 1987, 235: 177182. 10.1126/science.3798106.
 15.
Sun Y, Goodison S: Optimizing molecular signatures for predicting prostate cancer recurrence . Prostate. 2009, 69: 11191127. 10.1002/pros.20961.
 16.
Goodison S, Sun Y, Urquidi V: Derivation of cancer diagnostic and prognostic signatures from gene expression data . Bioanalysis. 2010, 2: 855862. 10.4155/bio.10.35.
 17.
Tibshirani R: Regression shrinkage and selection via the lasso . J R Stat Soc Ser B (Methodological). 1996, 58: 267288.
 18.
Qiu P, Gentles AJ, Plevritis SK: Discovering biological progression underlying microarray samples . PLoS Comput Biol. 2011, 7: 100112310.1371/journal.pcbi.1001123.
 19.
Sun Y, Todorovic S, Goodison S: Locallearningbased feature selection for highdimensional data analysis . IEEE Trans Pattern Anal Mach Intell. 2010, 32: 16101626. 10.1109/TPAMI.2009.190.
 20.
Sun Y: Iterative RELIEF for feature weighting: algorithms, theories, and applications . IEEE Trans Pattern Anal Mach Intell. 2007, 29: 10351051. 10.1109/TPAMI.2007.1093.
 21.
Kira K, Rendell LA: A practical approach to feature selection. In Proceedings of the Ninth International Workshop on Machine Learning. San Francisco, CA; 1992:249256.
 22.
Crammer K, GiladBachrach R, Navot A, Tishby N: Margin analysis of the LVQ algorithm. In Proceedings of Advances in Neural Information Processing Systems. Volume 2; 2002:462469.
 23.
Vapnik V: The Nature of Statistical Learning Theory, New York: Springer; 2000.
 24.
Schapire RE, Freund Y, Bartlett P, Lee WS: Boosting the margin: a new explanation for the effectiveness of voting methods . Ann Stat. 1998, 26: 16511686. 10.1214/aos/1024691352.
 25.
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm . J R Stat Soc Ser B (Methodological). 1977, 39: 138.
 26.
Cai Y, Sun Y, Cheng Y, Li J, Goodison S: Fast implementation of ℓ _{1}regularized learning algorithms using gradient descent methods. In 10th SIAM International Conference on Data Mining; 2010:862871.,
 27.
Kress R: Numerical Analysis, New York: Springer; 1998.
 28.
Luxburg U: A tutorial on spectral clustering . Stat Comput. 2007, 17: 395416. 10.1007/s112220079033z.
 29.
Manning CD, Raghavan P, Schütze H: Introduction to Information Retrieval, Cambridge: Cambridge University Press; 2008.
 30.
ZelnikManor L, Perona P: Selftuning spectral clustering. In Proceedings of Advances in Neural Information Processing Systems. Volume 17; 2004:16011608.
 31.
Ng AY, Jordan MI, Weiss Y: On spectral clustering: analysis and an algorithm. In Proceedings of Advances in Neural Information Processing Systems. Volume 14; 2001:849856.
 32.
Monti S, Tamayo P, Mesirov J, Golub T: Consensus clustering: a resamplingbased method for class discovery and visualization of gene expression microarray data . Mach Learn. 2003, 52: 91118. 10.1023/A:1023949509487.
 33.
Hastie T, Stuetzle W: Principal curves . J Am Stat Assoc. 1989, 84: 502516. 10.1080/01621459.1989.10478797.
 34.
Tibshirani R: Principal curves revisited . Stat Comput. 1992, 2: 183190. 10.1007/BF01889678.
 35.
Ozertem U, Erdogmus D: Locally defined principal curves and surfaces . J Mach Learn Res. 2011, 12: 12491286.
 36.
Bishop CM: Pattern Recognition and Machine Learning, New York: Springer; 2006.
 37.
Duchamp T, Stuetzle W: Extremal properties of principal curves in the plane . Ann Stat. 1996, 24: 15111520. 10.1214/aos/1032298280.
 38.
Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic . J R Stat Soc Ser B (Stat Methodol). 2001, 63: 411423. 10.1111/14679868.00293.
 39.
Sugar CA: Techniques for clustering and classification with applications to medical problems. PhD thesis. Stanford University, 1998.
 40.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning, New York: Springer; 2009.
 41.
Kirkwood BR, Sterne JA: Medical Statistics, Massachusetts: Blackwell Science; 2003.
 42.
Perou CM, Parker JS, Prat A, Ellis MJ, Bernard PS: Clinical implementation of the intrinsic subtypes of breast cancer . Lancet Oncol. 2010, 11: 718719. 10.1016/S14702045(10)701765.
 43.
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, Quackenbush JF, Stijleman IJ, Palazzo J, Marron JS, Nobel AB, Mardis E, Nielsen TO, Ellis MJ, Perou CM, Bernard PS: Supervised risk predictor of breast cancer based on intrinsic subtypes . J Clin Oncol. 2009, 27: 11601167. 10.1200/JCO.2008.18.1370.
 44.
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Van De Rijn M, Jeffrey SS, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning PE, BørresenDale AL: Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications . Proc Nat Acad Sci. 2001, 98: 1086910874. 10.1073/pnas.191367098.
 45.
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, BørresenDale AL, Botstein D: Repeated observation of breast tumor subtypes in independent gene expression data sets . Proc Nat Acad Sci. 2003, 100: 84188423. 10.1073/pnas.0932692100.
 46.
Ciriello G, Sinha R, Hoadley KA, Jacobsen AS, Reva B, Perou CM, Sander C, Schultz N: The molecular diversity of luminal A breast tumors . Breast Cancer Res Treat. 2013, 141: 409420. 10.1007/s1054901326993.
 47.
Creighton CJ: The molecular profile of luminal B breast cancer . Biologics: Targets Therapy. 2012, 6: 289
 48.
Kaplan EL, Meier P: Nonparametric estimation from incomplete observations . J Am Stat Assoc. 1958, 53: 457481. 10.1080/01621459.1958.10501452.
 49.
NofechMozes S, Trudeau M, Kahn HK, Dent R, Rawlinson E, Sun P, Narod SA, Hanna WM: Patterns of recurrence in the basal and nonbasal subtypes of triplenegative breast cancers . Breast Cancer Res Treat. 2009, 118: 131137. 10.1007/s1054900802958.
 50.
Xu H, Eirew P, Mullaly SC, Aparicio S: The omics of triplenegative breast cancers . Clin Chem. 2014, 60: 122133. 10.1373/clinchem.2013.207167.
 51.
Guiu S, Michiels S, Andre F, Cortes J, Denkert C, Di Leo A, Hennessy B, Sorlie T, Sotiriou C, Turner N, Van de Vijver M, Viale G, Loi S, ReisFilho JS: Molecular subclasses of breast cancer: how do we define them? The IMPAKT 2012 working group statement . Ann Oncol. 2012, 23: 29973006. 10.1093/annonc/mds586.
 52.
Molyneux G, Geyer FC, Magnay FA, McCarthy A, Kendrick H, Natrajan R, MacKay A, Grigoriadis A, Tutt A, Ashworth A, ReisFilho JS, Smalley MJ: BRCA1 basallike breast cancers originate from luminal epithelial progenitors and not from basal stem cells . Cell Stem Cell. 2010, 7: 403417. 10.1016/j.stem.2010.07.010.
 53.
Rakha EA, ElSayed ME, Lee AH, Elston CW, Grainge MJ, Hodi Z, Blamey RW, Ellis IO: Prognostic significance of Nottingham histologic grade in invasive breast carcinoma . J Clin Oncol. 2008, 26: 31533158. 10.1200/JCO.2007.15.5986.
 54.
Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, Desmedt C, Larsimont D, Cardoso F, Peterse H, Nuyten D, Buyse M, Van de Vijver MJ, Bergh J, Piccart M, Delorenzi M: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis . J Nat Cancer Inst. 2006, 98: 262272. 10.1093/jnci/djj052.
 55.
Calza S, Hall P, Auer G, Bjöhle J, Klaar S, Kronenwett U, Liu ET, Miller L, Ploner A, Smeds J, Bergh J, Pawitan Y: Intrinsic molecular signature of breast cancer in a populationbased cohort of 412 patients . Breast Cancer Res. 2006, 8: 3410.1186/bcr1517.
 56.
Santarius T, Shipley J, Brewer D, Stratton MR, Cooper CS: A census of amplified and overexpressed human cancer genes . Nat Rev Cancer. 2010, 10: 5964. 10.1038/nrc2771.
 57.
Watson IR, Takahashi K, Futreal PA, Chin L: Emerging patterns of somatic mutations in cancer . Nat Rev Genet. 2013, 14: 703718. 10.1038/nrg3539.
 58.
Tomasetti C, Vogelstein B, Parmigiani G: Half or more of the somatic mutations in cancers of selfrenewing tissues originate prior to tumor initiation . Proc Nat Acad Sci. 2013, 110: 19992004. 10.1073/pnas.1221068110.
 59.
Desmedt C, HaibeKains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, Delorenzi M, Piccart M, Sotiriou C: Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes . Clin Cancer Res. 2008, 14: 51585165. 10.1158/10780432.CCR074756.
 60.
Peddi PF, Ellis MJ, Ma C: Molecular basis of triple negative breast cancer and implications for therapy . Int J Breast Cancer. 2012, 2012: 21718510.1155/2012/217185.
 61.
Simpson KL, Cawthorne C, Zhou C, Hodgkinson CL, Walker MJ, Trapani F, Kadirvel M, Brown G, Dawson M, MacFarlane M, Williams KJ, Whetton AD, Dive C: A caspase3 `deathswitch’ in colorectal cancer cells for induced and synchronous tumor apoptosis in vitro and in vivo facilitates the development of minimally invasive cell death biomarkers . Cell Death Dis. 2013, 4: 61310.1038/cddis.2013.137.
 62.
HaibeKains B, Desmedt C, Loi S, Culhane AC, Bontempi G, Quackenbush J, Sotiriou C: A threegene model to robustly identify breast cancer molecular subtypes . J Nat Cancer Inst. 2012, 104: 311325. 10.1093/jnci/djr545.
 63.
Manavi M, Hudelist G, FinkRetter A, GschwandtlerKaulich D, Pischinger K, Czerwenka K: Gene profiling in papcell smears of highrisk human papillomaviruspositive squamous cervical carcinoma . Gynecol Oncol. 2007, 105: 418426. 10.1016/j.ygyno.2006.12.032.
 64.
Schultz J, Koczan D, Schmitz U, Ibrahim SM, Pilch D, Landsberg J, Kunz M: Tumorpromoting role of signal transducer and activator of transcription (Stat) 1 in latestage melanoma growth . Clin Exp Metastasis. 2010, 27: 133140. 10.1007/s1058501093107.
 65.
Hix LM, Karavitis J, Khan MW, Shi YH, Khazaie K, Zhang M: Tumor STAT1 transcription factor activity enhances breast tumor growth and immune suppression mediated by myeloidderived suppressor cells . J Biol Chem. 2013, 288: 1167611688. 10.1074/jbc.M112.441402.
 66.
Weichselbaum RR, Ishwaran H, Yoon T, Nuyten DS, Baker SW, Khodarev N, Su AW, Shaikh AY, Roach P, Kreike B, Roizman B, Bergh J, Pawitan Y, van de Vijver MJ, Minn AJ: An interferonrelated gene signature for DNA damage resistance is a predictive marker for chemotherapy and radiation for breast cancer . Proc Nat Acad Sci. 2008, 105: 1849018495. 10.1073/pnas.0809242105.
 67.
Greenwood C, Metodieva G, AlJanabi K, Lausen B, Alldridge L, Leng L, Bucala R, Fernandez N, Metodiev MV: Stat1 and CD74 overexpression is codependent and linked to increased invasion and lymph node metastasis in triplenegative breast cancer . J Proteomics. 2012, 75: 30313040. 10.1016/j.jprot.2011.11.033.
 68.
AdachKilon A, SwiatekMachado K, Kaminska B, Dabrowski M: Signal transducer and activator of transcription 1 (Stat1) maintains basal mRNA expression of prosurvival stat3target genes in glioma C6 cells . J Cell Biochem. 2011, 112: 36853694. 10.1002/jcb.23305.
 69.
Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW: Cancer genome landscapes . Science. 2013, 339: 15461558. 10.1126/science.1235122.
 70.
Kittler R, Zhou J, Hua S, Ma L, Liu Y, Pendleton E, Cheng C, Gerstein M, White KP: A comprehensive nuclear receptor network for breast cancer cells . Cell Reports. 2013, 3: 538551. 10.1016/j.celrep.2013.01.004.
 71.
MildeLangosch K, Bamberger AM, Rieck G, Kelp B, Löning T: Overexpression of the p16 cell cycle inhibitor in breast cancer is associated with a more malignant phenotype . Breast Cancer Res Treat. 2001, 67: 6170. 10.1023/A:1010623308275.
 72.
Hui R, Macmillan RD, Kenny FS, Musgrove EA, Blamey RW, Nicholson RI, Robertson JF, Sutherland RL: INK4a gene expression and methylation in primary breast cancer: overexpression of p16INK4a messenger RNA is a marker of poor prognosis . Clin Cancer Res. 2000, 6: 27772787.
 73.
Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, McHenry KT, Pinchback RM, Ligon AH, Cho YJ, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, et al: The landscape of somatic copynumber alteration across human cancers . Nature. 2010, 463: 899905. 10.1038/nature08822.
 74.
Stratton MR: Exploring the genomes of cancer cells: progress and promise . Science. 2011, 331: 15531558. 10.1126/science.1204040.
 75.
Wood LD, Parsons DW, Jones S, Lin J, Sjöblom T, Leary RJ, Shen D, Boca SM, Barber T, Ptak J, Silliman N, Szabo S, Dezso Z, Ustyanksky V, Nikolskaya T, Nikolsky Y, Karchin R, Wilson PA, Kaminker JS, Zhang Z, Croshaw R, Willis J, Dawson D, Shipitsin M, Willson JKV, Sukumar S, Polyak K, Park BH, Pethiyagoda CL, Pant PVK, et al: The genomic landscapes of human breast and colorectal cancers . Science. 2007, 318: 11081113. 10.1126/science.1145720.
 76.
Raphael BJ, Dobson JR, Oesper L, Vandin F: Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine . Genome Med. 2014, 6: 510.1186/gm524.
 77.
Kohonen T, Honkela T: Kohonen network . Scholarpedia. 2007, 2: 156810.4249/scholarpedia.1568.
 78.
Tenenbaum JB, De Silva V, Langford JC: A global geometric framework for nonlinear dimensionality reduction . Science. 2000, 290: 23192323. 10.1126/science.290.5500.2319.
Acknowledgements
Thanks to the collective efforts of the TCGA initiative of the National Institutes of Health (NIH) and equivalent European consortia, multiformat data from large numbers of tumor tissue samples are publicly available. Without these efforts, it would not be feasible to develop computational approaches that describe the dynamic aspects of disease progression. This work was supported in part by the National Science Foundation under grant number 1322212 (YS), the SUNY Research Foundation (YS), the JR Oishei Foundation (NJN) and NIH RO1 CA108597 (SG).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Principal Curve
 Spectral Cluster
 Copy Number Data
 Basal Phenotype
 Luminal Phenotype