I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

Wong, Kin Yau; Fan, Cheng; Tanioka, Maki; Parker, Joel S.; Nobel, Andrew B.; Zeng, Donglin; Lin, Dan-Yu; Perou, Charles M.

doi:10.1186/s13059-019-1640-4

Method
Open access
Published: 07 March 2019

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

Kin Yau Wong¹,
Cheng Fan²,
Maki Tanioka^2,3,
Joel S. Parker^2,3,
Andrew B. Nobel^2,4,5,
Donglin Zeng^2,5,
Dan-Yu Lin^2,5 &
…
Charles M. Perou ORCID: orcid.org/0000-0001-9827-2247^2,3

Genome Biology volume 20, Article number: 52 (2019) Cite this article

4426 Accesses
5 Citations
10 Altmetric
Metrics details

Abstract

We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods. By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data.

Background

Prediction of disease outcomes, such as individual patient survival time, is critically important for cancer patients. Traditional prognostic models that rely solely on clinical variables, such as age and tumor stage, fail to account for the molecular heterogeneity of tumors and thus may lead to suboptimal treatment decisions [1]. To remedy this situation, many studies have incorporated gene expression data in survival prediction [2–5].

Large-scale genomics projects such as The Cancer Genome Atlas (TCGA) have generated detailed molecular data on patients with a variety of cancer types. In TCGA, six types of “omics” data have been collected on the same set of patients: DNA copy number variation, somatic mutation, mRNA expression, microRNA expression, DNA methylation, and expression of ∼ 200 proteins/phosphoproteins. The availability of multiple data types has enabled researchers to address a variety of important questions. For example, patients can be more precisely classified into molecular subtypes based on integrative clustering of multiple genomics data types or platforms [6–8]. In addition, it is possible to identify genes that are related to patient survival time by decomposing the expression of each gene into a component that is explained by the methylation level and a component that is not [9].

One unsolved issue in cancer genomics is the prognostic value of integrated genomics and clinical data versus clinical data only. Yuan et al. [10] compared models with clinical data only versus models with both clinical and genomics data on various cancer types and concluded that genomics data provide only a limited gain in survival prediction accuracy. In their analysis, however, potential differences among data types were not taken into account. For breast cancer, for instance, the combination of genomics and clinical data has been shown to improve outcome predictions [11, 12]. A major goal of the present work is to fully explore the predictive power of integrating clinical and genomics data together.

A second unsolved issue is the prognostic value of individual gene expression values (∼ 25,000) versus a predefined set of gene expression signatures or “modules” (∼ 500). Gene modules have been developed for representing distinct cell types (e.g., epithelial, immune, and endothelial), specific biological processes, or activated molecular signaling pathways. They have been shown to successfully capture signaling pathway activities or cell type heterogeneity within tumors. We wish to investigate whether individual gene expression data or existing gene modules provide more accurate outcome prediction.

A third unsolved issue is the relative importance of different types of genomics data in outcome prediction. Different data types are collected at different costs and also with widely varying feature spaces. Naturally, not all data types are equally important in outcome prediction. We aim to determine which data types may be omitted from analysis without a significant reduction in prediction accuracy.

An overarching methodological challenge in addressing the aforementioned issues is the identification of genomic variables predictive of survival time when the number of variables is much larger than the sample size. Penalized regression methods, such as least absolute shrinkage and selection operator (LASSO) [13] and elastic net [14], are commonly used to identify important genomic variables. When variables are highly correlated, elastic net tends to have better performance in prediction than LASSO [14]. However, both LASSO and elastic net are generic variable selection procedures that do not distinguish different types of data and thus tend to select more variables from the data types with larger numbers of variables. Because different data types capture different biological structures, both large and small data types may carry important signals. Methods that treat all variables equally may not be able to pick out independent signals from small data types. In addition, LASSO and elastic net impose the same penalty on all regression parameters, which may be overly restrictive because the number of variables and the signal strength vary drastically across data types.

Boosting is an alternative to penalization for model estimation and prediction in high-dimensional settings. It was originally developed for binary classification in machine learning [15, 16]. The idea of boosting is to iteratively reweight the observations, with larger weights given to observations that are misclassified at the previous iteration, and apply simple classifiers on the reweighted data; their results are then combined to produce an aggregated classification procedure. Boosting was later generalized as a forward stagewise additive modeling method for statistical estimation [17, 18], which can be applied to many problems, including regression analysis for survival data [19]. Because of its flexibility in modeling choices and stability in high-dimensional settings, boosting has found applications in genomics studies; see the references in Mayr et al. [20, 21]. As in the case of LASSO and elastic net, however, existing boosting methods, such as component-wise boosting [22], do not distinguish variables of different data types.

To overcome the limitations of LASSO, elastic net, and existing boosting methods, we develop a novel method, termed Integrative Boosting (I-Boost), which combines elastic net with boosting. In I-Boost, the prediction rule is constructed iteratively, where at each iteration, the predictive power of each data type (conditional on the current prediction rule) is evaluated separately and the most predictive data type is selected to update the prediction rule using elastic net. Thus, independent signal from each data type can be incorporated into the prediction rule, and small but predictive data types will not be dominated by data types with large numbers of variables. In addition, the penalties on the regression parameters are learned data-adaptively and separately for different data types. Herein, we demonstrate the advantages of I-Boost using simulation studies and empirical data from the TCGA on patients with eight different cancer types. More importantly, we use I-Boost to address the aforementioned three unsolved issues in cancer genomics.

Results and discussion

Background

Suppose that there are K types of clinical or genomics predictors, with d_k components for the kth type (k=1,…,K). For k=1,…,K, let X^(k) denote the d_k-vector of predictors of the kth type. Write X=(X^(1)′,…,X^(K)′)^′, where A^′ denotes the transpose of A for any vector or matrix A. Let T denote the survival time of interest. We relate T to X through the proportional hazards model [23], such that the conditional hazard function of T given X takes the form of h₀(t) exp(β^′X), where h₀(t) is an arbitrary baseline hazard function, β=(β^(1)′,…,β^(K)′)^′, and β^(k) is a d_k-vector of regression parameters associated with X^(k).

The survival time T is subject to right censoring by C, such that we observe Y≡ min(T,C) and Δ≡I(T≤C), where I(·) is the indicator function. For a study with n patients, the data consist of (Y_i,Δ_i,X_i)(i=1,…,n). The partial likelihood [24] for β is

$$L(\boldsymbol{\beta}) = \prod_{i=1}^{n} \left(\frac{e^{\boldsymbol{X}^{\prime}_{i}\boldsymbol{\beta}}} {\sum_{j:Y_{j}\ge Y_{i}}e^{\boldsymbol{X}^{\prime}_{j}\boldsymbol{\beta}}}\right)^{\Delta_{i}}. $$

LASSO and elastic net

Because X is high-dimensional, it is not feasible to estimate β by maximizing the partial likelihood. One possible remedy is to impose sparsity assumptions on β and adopt penalization methods, such as LASSO [13] and elastic net [14]. LASSO estimates β by maximizing the L₁-penalized log-partial likelihood function

$$\log L(\boldsymbol{\beta}) - \lambda\sum_{j=1}^{d}|\beta_{j}|, $$

where $d=\sum _{k=1}^{K} d_{k}$, and λ is a tuning parameter. Elastic net generalizes LASSO by including an L₂ penalty, such that the objective function becomes

$$\log L(\boldsymbol{\beta}) - \lambda \left\{\alpha\sum_{j=1}^{d}|\beta_{j}| + \frac{1}{2}(1-\alpha)\sum_{j=1}^{d}\beta_{j}^{2}\right\}, $$

where α∈[ 0,1] is a tuning parameter that controls the relative magnitudes of the L₁ and L₂ penalties. (When α=1, elastic net reduces to LASSO.) The implementation of LASSO and elastic net is described in the “Methods” section.

For both LASSO and elastic net, the penalty term dominates under large values of λ, and the parameter estimates tend to be small with some values being exactly zero. Unlike LASSO, elastic net exhibits the grouping effect in that the regression parameters for a group of highly correlated variables tend to be equal, which is desirable in the context of gene selection [14]. Both LASSO and elastic net impose the same penalization on each regression parameter and thus do not distinguish different types of predictors. As a result, these methods may be inefficient when certain data types are much more predictive than others.

I-Boost

To account for the differential predictive power of different data types, we propose a boosting algorithm called I-Boost. Boosting is an iterative optimization algorithm that minimizes a loss function $\ell \{\mathcal {Y},\boldsymbol {f}(\mathcal {X})\}$ over a class of functions of predictors $\boldsymbol {f}(\mathcal {X})$, where $\mathcal {Y}=(Y_{1},\ldots,Y_{n},\Delta _{1},\ldots,\Delta _{n})$, $\mathcal {X}=(\boldsymbol {X}_{1},\ldots,\boldsymbol {X}_{n})$, and $\ell \{\mathcal {Y},\boldsymbol {f}(\mathcal {X})\}$ measures the deviation of the prediction $\boldsymbol {f}(\mathcal {X})$ from the outcome $\mathcal {Y}$. At each iteration, we update $\boldsymbol {f}(\mathcal {X})$ additively by the value $\boldsymbol {b}(\mathcal {X};\boldsymbol {\beta })$ up to a scaling factor, where b is a fixed basis function, and β is a vector of parameters. Specifically, at the mth iteration, we find β^(m) that minimizes $\ell \{\mathcal {Y},\boldsymbol {f}_{m-1}(\mathcal {X}) + \boldsymbol {b}(\mathcal {X};\boldsymbol {\beta }^{(m)})\}$, possibly under some constraints on β^(m), where f_m−1 is the estimate of f at the (m−1)th iteration. Then, we set $\boldsymbol {f}_{m}(\mathcal {X})=\boldsymbol {f}_{m-1}(\mathcal {X}) + v \boldsymbol {b}(\mathcal {X};\boldsymbol {\beta }^{(m)})$ for some fixed step length factor v∈(0,1]. We terminate the iterations when some stopping criterion is satisfied.

In I-Boost, we set the loss function $\ell \{\mathcal {Y},\boldsymbol {f}(\mathcal {X})\}$ to be the negative log-partial likelihood function and the basis function to be $\boldsymbol {b}(\mathcal {X};\boldsymbol {\beta }^{(m)})=(\boldsymbol {X}_{1}^{(k)\prime }\boldsymbol {\beta }^{(m)},\ldots,\boldsymbol {X}_{n}^{(k)\prime }\boldsymbol {\beta }^{(m)})^{\prime }$, where $\boldsymbol {X}^{(k)}_{i}$ is the vector of the kth type of predictors for the ith patient, and the data type k is selected data-adaptively. At each iteration, we search over all data types, select the one that yields the largest decrease in the loss function value at the current iteration, and update (a subset of) the regression parameters corresponding to the selected data type; other parameters are fixed at their current estimated values. To handle high-dimensional data, we impose an elastic net penalty on β^(m) in the optimization step. Effectively, we perform maximum penalized log-partial likelihood estimation with an offset term $\boldsymbol {f}_{m-1}(\mathcal {X})$ using a single data type at each iteration. Unlike existing boosting methods, such as component-wise boosting, the basis function in our case is a function of all variables of a data type instead of a single variable. This choice of basis function is motivated by the expectations that some data types are much more predictive than others and that the inclusion of less predictive data types may reduce the prediction accuracy of the model. By considering each data type separately, we perform selection on the data-type level at each iteration.

We propose two versions of I-Boost, namely I-Boost-CV and I-Boost-Permutation, which use cross-validation and permutation, respectively, to choose the tuning parameters of elastic net at each iteration. The permutation procedure randomly permutes the outcome variables in order to remove association between the predictors and the outcome, and the tuning parameters are chosen such that no predictor is selected in half of the permuted data sets. The procedures are described in detail in the “Methods” section.

Simulation studies

We conducted simulation studies to evaluate the performance of LASSO, elastic net, and the two versions of I-Boost. We considered three simulation settings, with different distributions of signals across the data types. In all three settings, a relatively large proportion of the signals is contributed by the clinical variables. The distributions of signals are shown in Fig. 1, and the details of the simulation settings are provided in the “Methods” section.

We assessed the performance of the methods by the quality of prediction and parameter estimation. For prediction, we report the correlation between the estimated risk score $\sum _{k=1}^{K} \boldsymbol {X}^{(k)\prime } \hat {\boldsymbol {\beta }}^{(k)}$ and the true risk score $\sum _{k=1}^{K} \boldsymbol {X}^{(k)\prime } \boldsymbol {\beta }^{(k)}_{0}$, where $\hat {\boldsymbol {\beta }}^{(k)}$ and $\boldsymbol {\beta }^{(k)}_{0}$ are the estimated and true parameter vectors, respectively. A higher correlation represents a greater degree of agreement between the predicted and actual outcomes. We call this measure the risk correlation. For parameter estimation, we report the mean-squared error (MSE), defined as $\sum _{k=1}^{K}\| \hat {\boldsymbol {\beta }}^{(k)} - \boldsymbol {\beta }^{(k)}_{0}\|_{2}^{2}$.

Figure 1 shows the risk correlation and MSE for elastic net, LASSO, and the two versions of I-Boost based on 1000 replications; the average number of variables selected for each data type is also shown. I-Boost-CV always selects the largest number of variables, followed by elastic net, LASSO, and I-Boost-Permutation. I-Boost-CV selects a large number variables, because it iteratively performs elastic net, and the final model includes selected variables accumulated over all iterations. By contrast, I-Boost-Permutation, though iterative, performs LASSO (which generally selects fewer variables than elastic net) with the tuning parameter selected by the very conservative permutation method [25], so that it selects the least number of variables.

For estimation, the MSE under I-Boost-CV or I-Boost-Permutation is about 20–40% smaller than that under LASSO or elastic net in all settings. Decomposition of the MSE by data types reveals that the MSE for data types with very weak or no signal is small for I-Boost. This result shows that even though I-Boost-CV selects a relatively large number of variables from these data types, the variables generally have very small estimated regression parameters.

For prediction, the two I-Boost methods perform the best overall. In all settings, I-Boost-CV produces more accurate prediction than all other methods. In Settings 1 and 2, where most signals are concentrated on only one or two data types, I-Boost-Permutation produces more accurate prediction than both elastic net and LASSO. In Setting 3, I-Boost-Permutation performs similarly to elastic net, while LASSO performs worse than I-Boost-Permutation. Between the two versions of I-Boost, I-Boost-CV tends to yield better prediction than I-Boost-Permutation, possibly because of the larger number of variables selected by I-Boost-CV. Thus, if the main interest is the selection of relevant variables, then one might consider I-Boost-Permutation for more conservative variable selection, even though this method is somewhat inferior in prediction when compared to I-Boost-CV.

We implemented LASSO, elastic net, and the two versions of I-Boost using R-3.2.2 on a 2.93-GHz Xeon Linux computer. On average, performing LASSO, elastic net, I-Boost-Permutation, and I-Boost-CV on one simulated data set (that consists of 500 subjects, 6 data types, and 1294 predictors) takes about 2 min, 14 min, 3 h, and 38 h, respectively. I-Boost-CV is computationally intensive because in each iteration, cross-validation is conducted on a three-dimensional grid. By contrast, in each I-Boost-Permutation iteration, the tuning parameter α is fixed at 1, no cross-validation is involved in the selection of λ, and LASSO is performed only once for each data type. Therefore, I-Boost-Permutation may serve as a computationally efficient alternative to I-Boost-CV.

Evaluation of LASSO, elastic net, and I-Boost using TCGA data

We next evaluated the performance of the methods using three TCGA data sets, namely the lung adenocarcinoma (LUAD) data set, the kidney renal clear cell cancer (KIRC) data set, and a pan-cancer data set derived from ∼ 1400 patients that represents eight different tumor types considered by Hoadley et al. [26]; see the “Methods” section for a detailed description of the data sets and the evaluation procedure. For each data set, we first split the data 30 times into training and testing sets. We then performed LASSO, elastic net, and the two versions of I-Boost for various combinations of data types on patients from the training set of each split. For each combination of data types and each split, we calculated the risk scores for patients in the testing set using the estimates from the corresponding training set, and we used the concordance index (C-index) [27] to evaluate the prediction accuracy of the risk scores.

The average C-index values over the splits obtained from LASSO and elastic net are given in Fig. 2. For the KIRC and pan-cancer data sets, the prediction tends to be much better than random (i.e., the C-index values are much larger than 0.5). For the LUAD data set, which has a small sample size, some of the models yield relatively poor prediction (with C-index values smaller than 0.6). For many models, the predictive performance of elastic net is either similar or superior to LASSO.

For LASSO and elastic net, the models containing more data types as predictors do not necessarily perform better than those with fewer data types. One possible explanation is that the extra data types may contain very little relevant information on patient survival, such that adding those data types introduces more noise than signal into the model. In practice, however, it is challenging to decide which data types to consider without prior knowledge of their importance.

Figure 3 shows the average values of the C-index obtained from elastic net, I-Boost-CV, and I-Boost-Permutation for different models. For the LUAD, KIRC, and pan-cancer data sets, both versions of I-Boost provide better prediction than elastic net in almost all cases. The difference in prediction accuracy between I-Boost and elastic net is particularly large when the sample size is small and the number of predictors is large. The difference is likely due to the fact that I-Boost involves the selection of data types, so that the large and non-predictive data types would not be selected in most iterations, and their presence would not substantially worsen the prediction accuracy. For the KIRC and pan-cancer data sets, I-Boost-CV yields better prediction than I-Boost-Permutation, whereas for LUAD, there are no clear differences between the two methods.

Prognostic value of integrated clinical and genomics data

To assess whether the genomic variables provide extra predictive power in the presence of the clinical variables, we computed the net reclassification improvement (NRI) [28, 29] values between the models with both clinical and genomic variables (estimated by I-Boost-CV or I-Boost-Permutation) and the model with clinical variables only (estimated by maximum partial likelihood estimation). The NRI compares a model of interest with a baseline model and measures how much a subject’s predicted risk under the model of interest, relative to that under the baseline model, aligns with the subject’s survival time. For instance, an NRI of 0.2 means that by switching from the baseline model to the model of interest, the proportion of high-risk subjects being reassigned a larger predicted risk is on average larger, by a value of 0.2, than the proportion of low-risk subjects being so reassigned; here, high-risk or low-risk subjects refer, respectively, to those with survival times shorter or longer than a fixed threshold, which we set to be 3 years throughout the paper. (See the “Methods” section for a theoretical definition of the NRI.) The average NRI values over data splits are shown in Fig. 4.

The patterns of the results from I-Boost-CV and I-Boost-Permutation are similar. For the KIRC and pan-cancer data sets, the majority of the models that contain both clinical and genomic variables yield positive NRI, which implies that they provide better prediction than the model with clinical variables only. Most NRI values under I-Boost-CV are close to 0.2; in biomarker studies, an NRI of 0.2 is considered an intermediate-level improvement [30]. For the LUAD data set, only a few models that contain both clinical and genomic variables provide better prediction than the model with clinical variables only. These results indicate that in certain cancer types, genomic variables contribute to survival prediction in the presence of clinical variables, and the magnitude of the contribution can be large. When the same comparisons are made using LASSO or elastic net, however, the inclusion of genomic variables in the models does not appreciably improve prediction.

Evaluation of gene expression modules

To compare the performance of gene modules versus individual gene expression data, we calculated the NRI values between models with each type of gene expression data separately. Specifically, for each combination of data types other than individual gene expression data and gene modules, we computed the NRI between the model with those data types and gene modules (estimated by I-Boost-CV or I-Boost-Permutation) and that with those data types and individual gene expression data. The NRI values are shown in Fig. 5. Under both methods, the use of gene modules leads to substantially better prediction than the use of expression data of all individual genes for the LUAD data set. For the KIRC and pan-cancer data sets, the performance of the two types of gene expression data is similar, and there is no strong evidence favoring gene modules or individual gene expression data on the basis of prediction accuracy. Nevertheless, because gene modules are smaller in number and much easier to interpret, we generally recommend the use of gene modules over individual gene expression data.

Comparison among genomics data types

To evaluate the relative prognostic value of each genomics data type, we formed a series of nested models as follows. We began by setting the model with clinical variables only as the first member of the series of models. At each later step, we computed the NRI between each model that contains all currently included data types and an extra genomics data type and the model included at the previous step. The model that yielded the largest NRI was set to be the next member of the series of models. The process was repeated until all data types were included. (Individual gene expression data were not considered in this analysis.) At each step of the process, the data type that yielded the largest improvement in predictive power (over the data types already included) was selected, so that more predictive data types tend to be included earlier, and the order in which the data types entered the models reflects their relative importance. We performed this procedure for elastic net and the two versions of I-Boost. For the LUAD, KIRC, and pan-cancer data sets, the NRI values and their 95% confidence intervals for the series of models are plotted in Fig. 6, and the data type selected at each step is shown. We also plotted the C-index against the number of variables selected for each model.

Because different methods vary in their abilities to extract useful information from given data types, the orders of data types determined by the methods are generally different. For the LUAD, KIRC, and pan-cancer data sets, the NRI under I-Boost-CV or I-Boost-Permutation tends to be positive or around zero with the inclusion of each new data type. This indicates that I-Boost extracts useful information from each additional data type and that its performance tends not to be worsened by the inclusion of additional variables.

I-Boost-Permutation always selects the smallest number of variables, followed by elastic net and I-Boost-CV. This finding is consistent with the conclusions from the simulation studies. Because the C-index obtained by I-Boost-Permutation is higher in most cases than that obtained by elastic net, we conclude that I-Boost-Permutation provides the same or better prediction using fewer variables than elastic net.

For the LUAD and pan-cancer data sets, gene modules are the first genomics data type selected under both versions of I-Boost, and the inclusion of gene modules leads to considerable improvement in prediction accuracy. For the KIRC data set, miRNA expression data are first selected by I-Boost-CV, while gene modules are first selected by I-Boost-Permutation. For I-Boost-CV, however, the model with clinical variables and gene modules yields an NRI of 0.19, which represents a substantial improvement over the model with clinical variables only. The confidence intervals of the NRI include zero due to the small sample sizes of the testing data sets. Nevertheless, the pattern of consistent positive NRI values shown in Fig. 4 and the fact that the NRI values are averages over 30 data splits suggest that the improvement in prediction accuracy is robust. For both versions of I-Boost, after the inclusion of the first genomics data type, the improvement in prediction accuracy with the inclusion of additional data types is marginal. We conclude that gene modules are overall the most predictive genomics data type, and the remaining genomics data types tend not to provide extra predictive power beyond clinical variables and gene modules.

We also evaluated the prognostic value of genomics data in the absence of clinical data. The average C-index values for combinations of genomics data types over 30 training and testing data splits for the LUAD, KIRC, and pan-cancer data sets are given in Additional file 1: Fig. S1. The maximum C-index values obtained using genomics data types alone are 0.64, 0.72, and 0.74 in the LUAD, KIRC, and pan-cancer data sets, respectively; they are substantially smaller than the corresponding maximum values obtained using both clinical and genomics data. For the LUAD data set, miRNA expression data alone yield the largest C-index, whereas for the KIRC data set, the combination of miRNA expression and protein expression data yield the largest C-index. For the pan-cancer data set, the C-index values for combinations of genomics data types with individual gene expression data are almost identical and are larger than those obtained without individual gene expression data.

Important predictors for the LUAD, KIRC, and pan-cancer data sets

To obtain the final models of important predictors, we performed I-Boost-Permutation on the LUAD, KIRC, and pan-cancer data sets. The final models are shown in Tables 1, 2, and 3 for the LUAD, KIRC, and pan-cancer data sets, respectively. The predictors that are also selected by LASSO, elastic net, and I-Boost-CV are marked.

Table 1 Analysis results from I-Boost-Permutation for the TCGA LUAD data set

Full size table

Table 2 Analysis results from I-Boost-Permutation for the TCGA KIRC data set

Full size table

Table 3 Analysis results from I-Boost-Permutation for the TCGA pan-cancer data set

Full size table

Age and pathological nodal status are negatively associated with survival time in the LUAD, KIRC, and pan-cancer data sets. Age has been reported to be prognostic for many cancer types [31–33]. In the analysis of the pan-cancer data set, cancer types were selected, which is logical, since the survival time is known to depend on cancer types [26]. Thus, the tissue of origin remains an important prognostic factor. Among the gene modules, Glycolysis_signature and MUnknown_24 are negatively associated with survival time in the LUAD and pan-cancer data sets; these two modules are correlated with Hypoxia signatures among a set of 1198 TCGA breast cancer patients. Likewise, Pcorr_IGS_Correlation and Activate.Endothelium, which are negatively associated with survival time in the pan-cancer data set, are correlated with proliferation signatures; the latter are known to be negatively associated with survival time.

In contrast, signatures of CD8 T cells, non-inflammatory breast cancer (nIBC and MM_Red2), and luminal features (Mature_LuminalUp, GP7_estrogen signaling, HS_Green1, HS_Green8, LUMINAL_Cluster, Duke_Module06_er, Pcorr_Dasatinib_L_Correlation, and HS_Green18) are positively associated with survival time in the KIRC or pan-cancer data sets. The NEU_cluster module is positively associated with survival time in the LUAD data set, which is biologically significant because this module represents epithelial luminal cell differentiation and thus tracks more differentiated and lower grade lung cancers. The selected features, many of which are also selected by other variable selection methods, have significant biological implications and demonstrate the robustness of the I-Boost methodology.

Conclusions

In this paper, we present a novel method, termed I-Boost, for variable selection and outcome prediction that is especially powerful when one wishes to simultaneously consider multiple genomics and/or proteomics data types. We used simulation studies and real data to demonstrate that in the presence of multiple data types with diverse signal strength, I-Boost produces better outcome prediction than LASSO and elastic net. We proposed two versions of I-Boost, namely I-Boost-CV and I-Boost-Permutation. I-Boost-CV yields more accurate prediction than I-Boost-Permutation, but it generally selects many more variables and is computationally more intensive. By contrast, I-Boost-Permutation is computationally efficient and selects much fewer variables, which may be preferable for follow-up experiments.

Consistent with the current literature, we found that clinical variables are strong predictors of survival time. With I-Boost, we were able to build upon the clinical variables and extract additional useful information from genomic variables in order to improve the prediction; the improvement that we obtained with I-Boost was considerably larger than that obtained by either LASSO or elastic net. We also compared the use of individual gene expression data versus gene modules and found that the use of gene modules leads to improvement in prediction accuracy and more interpretable results. When we considered the selected I-Boost models, clinical variables (e.g., age, tumor size, and pathological nodal status) were strong predictors of survival. The I-Boost methods also selected several gene modules that were previously identified as prognostic of outcomes, whether positive or negative.

Our study has limitations. The main limitation is that the LUAD and KIRC data sets pertain to a relatively small number of patients, with an even smaller number of observed events. This limitation motivated us to combine eight solid epithelial tumor types to form a large pan-cancer data set. The analyses on the pan-cancer data might not properly account for heterogeneity across different cancer types. Another limitation of our study is that the quality of the clinical data varies across different cancer types; for example, the follow-up time for some cancer types was quite short.

In summary, we demonstrated that the performance of I-Boost is superior to that of elastic net and LASSO and that the performance of gene modules is superior to that of the totality of individual genes. The I-Boost methodology is applicable to any disease states where multiple types of genomics and/or proteomics data are available and thus has potential applications beyond cancer studies.

Methods

Data description

TCGA provides a large open-access database that includes clinical and genomics data for patients with 33 cancer types or subtypes. Herein, we focused on eight cancer types or subtypes, namely, LUAD, KIRC, colon adenocarcinoma (COAD), rectal adenocarcinoma (READ), lung squamous cell carcinoma (LUSC), bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), and head and neck squamous cell carcinoma (HNSC). For clinical variables, somatic mutation, copy number variation, mRNA expression, and miRNA expression, data on 2272 patients were obtained from the December 22, 2012, Pan-Cancer-12 data freeze from the Sage Bionetworks Repository Synapse [34]; the data were previously processed and described by Hoadley et al. [26]. Protein expression data were downloaded from Broad GDAC Firehose [35] for a subset of 1779 patients included in the data set of Hoadley et al. [26].

Clinical variables included gender, age, pathological stages T and N, and cancer type. In all analyses, COAD and READ were considered as one cancer type. For mRNA expression data, we used RNA-seq by Expectation-Maximization (RSEM) [36] to quantify the transcript abundances measured by RNA sequencing and used the log2-transformed up-quantile-normalized RSEM values of 12,434 genes. The RNA sequencing was performed at the University of North Carolina at Chapel Hill [37–39]. Gene level expression data are also available on the Broad GDAC Firehose [35]. For mutation data, we used the single nucleotide variant calls, which were de-duplicated and re-annotated using the Ensembl version 69 transcript database. A total of 130 genes with non-synonymous mutations in more than 10% of the whole sample were included for the analyses. The combined mutation annotation format file is available from the Synapse resource. For miRNA expression data, we used the read count data for 305 normalized expressions, which were compiled into an abundance matrix for 5p and 3p mature miRBase strands [37]. For reverse-phase protein arrays, we used the level 3 normalized data for 136 proteins or phospho-proteins. For copy number data, SNP6.0 array-based gene-level somatic copy number alteration data were generated from the GISTIC analysis [40]. The input data matrix is available in Synapse at syn1710678. We used the copy number values for 216 cancer-specific segments, which are frequently altered in cancer of various types including breast cancer, and segments for all chromosome arms (a total of 41 segments) [41, 42].

We defined gene modules as sets of co-expressed genes that are considered to be functional units in breast cancer. We built a collection of 497 gene modules. The modules were constructed on the basis of 73 publications or results from the Gene Set Enrichment Analysis [43]. A partial list of the modules appears in Fan et al. [12]. Among the modules, 461 are median expression values for homogeneously expressed genes, 33 are correlations of expression values with predetermined gene centroids, and 3 are built from previously published gene expression prognostic models.

After removing patients with missing data, the total sample size was 1420, including 202 LUAD patients and 195 KIRC patients. All survival times were censored at 5 years if the patients were still in the study at that time point. For the pan-cancer data set, the median follow-up time was 16.8 months, and the censoring rate was 77.6%. For the subset of LUAD patients, the median follow-up time was 13.9 months, and the censoring rate was 71.3%. For the subset of KIRC patients, the median follow-up time was 28.9 months, and the censoring rate was 63.6%.