I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

We propose a statistical boosting method, termed I-Boost, to integrate multiple types of high-dimensional genomics data with clinical data for predicting survival time. I-Boost provides substantially higher prediction accuracy than existing methods. By applying I-Boost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data. Electronic supplementary material The online version of this article (10.1186/s13059-019-1640-4) contains supplementary material, which is available to authorized users.

integrative clustering of multiple genomics data types or platforms [6][7][8]. In addition, it is possible to identify genes that are related to patient survival time by decomposing the expression of each gene into a component that is explained by the methylation level and a component that is not [9].
One unsolved issue in cancer genomics is the prognostic value of integrated genomics and clinical data versus clinical data only. Yuan et al. [10] compared models with clinical data only versus models with both clinical and genomics data on various cancer types and concluded that genomics data provide only a limited gain in survival prediction accuracy. In their analysis, however, potential differences among data types were not taken into account. For breast cancer, for instance, the combination of genomics and clinical data has been shown to improve outcome predictions [11,12]. A major goal of the present work is to fully explore the predictive power of integrating clinical and genomics data together.
A second unsolved issue is the prognostic value of individual gene expression values (∼ 25,000) versus a predefined set of gene expression signatures or "modules" (∼ 500). Gene modules have been developed for representing distinct cell types (e.g., epithelial, immune, and endothelial), specific biological processes, or activated molecular signaling pathways. They have been shown to successfully capture signaling pathway activities or cell type heterogeneity within tumors. We wish to investigate whether individual gene expression data or existing gene modules provide more accurate outcome prediction.
A third unsolved issue is the relative importance of different types of genomics data in outcome prediction. Different data types are collected at different costs and also with widely varying feature spaces. Naturally, not all data types are equally important in outcome prediction. We aim to determine which data types may be omitted from analysis without a significant reduction in prediction accuracy.
An overarching methodological challenge in addressing the aforementioned issues is the identification of genomic variables predictive of survival time when the number of variables is much larger than the sample size. Penalized regression methods, such as least absolute shrinkage and selection operator (LASSO) [13] and elastic net [14], are commonly used to identify important genomic variables. When variables are highly correlated, elastic net tends to have better performance in prediction than LASSO [14]. However, both LASSO and elastic net are generic variable selection procedures that do not distinguish different types of data and thus tend to select more variables from the data types with larger numbers of variables. Because different data types capture different biological structures, both large and small data types may carry important signals. Methods that treat all variables equally may not be able to pick out independent signals from small data types. In addition, LASSO and elastic net impose the same penalty on all regression parameters, which may be overly restrictive because the number of variables and the signal strength vary drastically across data types.
Boosting is an alternative to penalization for model estimation and prediction in high-dimensional settings. It was originally developed for binary classification in machine learning [15,16]. The idea of boosting is to iteratively reweight the observations, with larger weights given to observations that are misclassified at the previous iteration, and apply simple classifiers on the reweighted data; their results are then combined to produce an aggregated classification procedure. Boosting was later generalized as a forward stagewise additive modeling method for statistical estimation [17,18], which can be applied to many problems, including regression analysis for survival data [19]. Because of its flexibility in modeling choices and stability in high-dimensional settings, boosting has found applications in genomics studies; see the references in Mayr et al. [20,21]. As in the case of LASSO and elastic net, however, existing boosting methods, such as component-wise boosting [22], do not distinguish variables of different data types.
To overcome the limitations of LASSO, elastic net, and existing boosting methods, we develop a novel method, termed Integrative Boosting (I-Boost), which combines elastic net with boosting. In I-Boost, the prediction rule is constructed iteratively, where at each iteration, the predictive power of each data type (conditional on the current prediction rule) is evaluated separately and the most predictive data type is selected to update the prediction rule using elastic net. Thus, independent signal from each data type can be incorporated into the prediction rule, and small but predictive data types will not be dominated by data types with large numbers of variables. In addition, the penalties on the regression parameters are learned data-adaptively and separately for different data types. Herein, we demonstrate the advantages of I-Boost using simulation studies and empirical data from the TCGA on patients with eight different cancer types. More importantly, we use I-Boost to address the aforementioned three unsolved issues in cancer genomics.

Results and discussion
Background Suppose that there are K types of clinical or genomics predictors, with d k components for the kth type (k = 1, . . . , K). For k = 1, . . . , K, let X (k) denote the d k -vector of predictors of the kth type. Write X = (X (1) , . . . , X (K) ) , where A denotes the transpose of A for any vector or matrix A. Let T denote the survival time of interest. We relate T to X through the proportional hazards model [23], such that the conditional hazard function of T given X takes the form of h 0 (t) exp(β X), where h 0 (t) is an arbitrary baseline hazard function, β = (β (1) , . . . , β (K) ) , and β (k) is a d k -vector of regression parameters associated with X (k) .
The survival time T is subject to right censoring by C, such that we observe Y ≡ min(T, C) and ≡ I(T ≤ C), where I(·) is the indicator function. For a study with n patients, the data consist of (Y i , i , X i ) (i = 1, . . . , n). The partial likelihood [24] for β is

LASSO and elastic net
Because X is high-dimensional, it is not feasible to estimate β by maximizing the partial likelihood. One possible remedy is to impose sparsity assumptions on β and adopt penalization methods, such as LASSO [13] and elastic net [14]. LASSO estimates β by maximizing the L 1 -penalized log-partial likelihood function where d = K k=1 d k , and λ is a tuning parameter. Elastic net generalizes LASSO by including an L 2 penalty, such that the objective function becomes where α ∈[0, 1] is a tuning parameter that controls the relative magnitudes of the L 1 and L 2 penalties. (When α = 1, elastic net reduces to LASSO.) The implementation of LASSO and elastic net is described in the "Methods" section.
For both LASSO and elastic net, the penalty term dominates under large values of λ, and the parameter estimates tend to be small with some values being exactly zero. Unlike LASSO, elastic net exhibits the grouping effect in that the regression parameters for a group of highly correlated variables tend to be equal, which is desirable in the context of gene selection [14]. Both LASSO and elastic net impose the same penalization on each regression parameter and thus do not distinguish different types of predictors. As a result, these methods may be inefficient when certain data types are much more predictive than others.

I-Boost
To account for the differential predictive power of different data types, we propose a boosting algorithm called I-Boost. Boosting is an iterative optimization algorithm that minimizes a loss function {Y, f (X )} over a class of functions of predictors f (X ), where Y = (Y 1 , . . . , Y n , 1 , . . . , n ), X = (X 1 , . . . , X n ), and {Y, f (X )} measures the deviation of the prediction f (X ) from the outcome Y. At each iteration, we update f (X ) additively by the value b(X ; β) up to a scaling factor, where b is a fixed basis function, and β is a vector of parameters. Specifically, at the mth iteration, we find (m) ) for some fixed step length factor v ∈ (0, 1]. We terminate the iterations when some stopping criterion is satisfied. In I-Boost, we set the loss function {Y, f (X )} to be the negative log-partial likelihood function and the basis function to be b(X ; β (m) is the vector of the kth type of predictors for the ith patient, and the data type k is selected dataadaptively. At each iteration, we search over all data types, select the one that yields the largest decrease in the loss function value at the current iteration, and update (a subset of ) the regression parameters corresponding to the selected data type; other parameters are fixed at their current estimated values. To handle high-dimensional data, we impose an elastic net penalty on β (m) in the optimization step. Effectively, we perform maximum penalized logpartial likelihood estimation with an offset term f m−1 (X ) using a single data type at each iteration. Unlike existing boosting methods, such as component-wise boosting, the basis function in our case is a function of all variables of a data type instead of a single variable. This choice of basis function is motivated by the expectations that some data types are much more predictive than others and that the inclusion of less predictive data types may reduce the prediction accuracy of the model. By considering each data type separately, we perform selection on the data-type level at each iteration.
We propose two versions of I-Boost, namely I-Boost-CV and I-Boost-Permutation, which use cross-validation and permutation, respectively, to choose the tuning parameters of elastic net at each iteration. The permutation procedure randomly permutes the outcome variables in order to remove association between the predictors and the outcome, and the tuning parameters are chosen such that no predictor is selected in half of the permuted data sets. The procedures are described in detail in the "Methods" section.

Simulation studies
We conducted simulation studies to evaluate the performance of LASSO, elastic net, and the two versions of I-Boost. We considered three simulation settings, with different distributions of signals across the data types. In all three settings, a relatively large proportion of the signals is contributed by the clinical variables. The distributions of signals are shown in Fig. 1, and the details of the simulation settings are provided in the "Methods" section.
We assessed the performance of the methods by the quality of prediction and parameter estimation. For prediction, we report the correlation between the estimated risk score K k=1 X (k) β (k) and the true risk score and β (k) 0 are the estimated and true parameter vectors, respectively. A higher correlation represents a greater degree of agreement between the predicted and actual outcomes. We call this measure the risk correlation. For parameter estimation, we report the mean-squared error (MSE), defined as K k=1     selected for each data type is also shown. I-Boost-CV always selects the largest number of variables, followed by elastic net, LASSO, and I-Boost-Permutation. I-Boost-CV selects a large number variables, because it iteratively performs elastic net, and the final model includes selected variables accumulated over all iterations. By contrast, I-Boost-Permutation, though iterative, performs LASSO (which generally selects fewer variables than elastic net) with the tuning parameter selected by the very conservative permutation method [25], so that it selects the least number of variables.
For estimation, the MSE under I-Boost-CV or I-Boost-Permutation is about 20-40% smaller than that under LASSO or elastic net in all settings. Decomposition of the MSE by data types reveals that the MSE for data types with very weak or no signal is small for I-Boost. This result shows that even though I-Boost-CV selects a relatively large number of variables from these data types, the variables generally have very small estimated regression parameters.
For prediction, the two I-Boost methods perform the best overall. In all settings, I-Boost-CV produces more accurate prediction than all other methods. In Settings 1 and 2, where most signals are concentrated on only one or two data types, I-Boost-Permutation produces more accurate prediction than both elastic net and LASSO. In Setting 3, I-Boost-Permutation performs similarly to elastic net, while LASSO performs worse than I-Boost-Permutation. Between the two versions of I-Boost, I-Boost-CV tends to yield better prediction than I-Boost-Permutation, possibly because of the larger number of variables selected by I-Boost-CV. Thus, if the main interest is the selection of relevant variables, then one might consider I-Boost-Permutation for more conservative variable selection, even though this method is somewhat inferior in prediction when compared to I-Boost-CV.
We implemented LASSO, elastic net, and the two versions of I-Boost using R-3.2.2 on a 2.93-GHz Xeon Linux computer. On average, performing LASSO, elastic net, I-Boost-Permutation, and I-Boost-CV on one simulated data set (that consists of 500 subjects, 6 data types, and 1294 predictors) takes about 2 min, 14 min, 3 h, and 38 h, respectively. I-Boost-CV is computationally intensive because in each iteration, cross-validation is conducted on a three-dimensional grid. By contrast, in each I-Boost-Permutation iteration, the tuning parameter α is fixed at 1, no cross-validation is involved in the selection of λ, and LASSO is performed only once for each data type. Therefore, I-Boost-Permutation may serve as a computationally efficient alternative to I-Boost-CV.

Evaluation of LASSO, elastic net, and I-Boost using TCGA data
We next evaluated the performance of the methods using three TCGA data sets, namely the lung adenocarcinoma (LUAD) data set, the kidney renal clear cell cancer (KIRC) data set, and a pan-cancer data set derived from ∼ 1400 patients that represents eight different tumor types considered by Hoadley et al. [26]; see the "Methods" section for a detailed description of the data sets and the evaluation procedure. For each data set, we first split the data 30 times into training and testing sets. We then performed LASSO, elastic net, and the two versions of I-Boost for various combinations of data types on patients from the training set of each split. For each combination of data types and each split, we calculated the risk scores for patients in the testing set using the estimates from the corresponding training set, and we used the concordance index (Cindex) [27] to evaluate the prediction accuracy of the risk scores.
The average C-index values over the splits obtained from LASSO and elastic net are given in Fig. 2. For the KIRC and pan-cancer data sets, the prediction tends to be much better than random (i.e., the C-index values are much larger than 0.5). For the LUAD data set, which has a small sample size, some of the models yield relatively poor prediction (with C-index values smaller than 0.6). For many models, the predictive performance of elastic net is either similar or superior to LASSO.
For LASSO and elastic net, the models containing more data types as predictors do not necessarily perform better than those with fewer data types. One possible explanation is that the extra data types may contain very little relevant information on patient survival, such that adding those data types introduces more noise than signal into the model. In practice, however, it is challenging to decide which data types to consider without prior knowledge of their importance. Figure 3 shows the average values of the C-index obtained from elastic net, I-Boost-CV, and I-Boost-Permutation for different models. For the LUAD, KIRC, and pan-cancer data sets, both versions of I-Boost provide better prediction than elastic net in almost all cases. The difference in prediction accuracy between I-Boost and elastic net is particularly large when the sample size is small and the number of predictors is large. The difference is likely due to the fact that I-Boost involves the selection of data types, so that the large and non-predictive data types would not be selected in most iterations, and their presence would not substantially worsen the prediction accuracy. For the KIRC and pan-cancer data sets, I-Boost-CV yields better prediction than I-Boost-Permutation, whereas for LUAD, there are no clear differences between the two methods.

Prognostic value of integrated clinical and genomics data
To assess whether the genomic variables provide extra predictive power in the presence of the clinical variables, we computed the net reclassification improvement (NRI) [28,29] values between the models with both clinical and genomic variables (estimated by I-Boost-CV or I-Boost-Permutation) and the model with clinical variables only (estimated by maximum partial likelihood estimation). The NRI compares a model of interest with a baseline model and measures how much a subject's predicted risk under the model of interest, relative to that under the baseline model, aligns with the subject's survival Fig. 3 Analysis results for the TCGA LUAD, KIRC, and pan-cancer data sets using elastic net, I-Boost-CV, and I-Boost-Permutation. Each row represents a particular combination of data types used as predictors, as indicated by the box on the left. Each dot is an average C-index value obtained by performing elastic net, I-Boost-CV, or I-Boost-Permutation on 30 training and testing data set pairs. See the caption of Fig. 1 for the abbreviations of the data types time. For instance, an NRI of 0.2 means that by switching from the baseline model to the model of interest, the proportion of high-risk subjects being reassigned a larger predicted risk is on average larger, by a value of 0.2, than the proportion of low-risk subjects being so reassigned; here, high-risk or low-risk subjects refer, respectively, to those with survival times shorter or longer than a fixed threshold, which we set to be 3 years throughout the paper. (See the "Methods" section for a theoretical definition of the NRI.) The average NRI values over data splits are shown in Fig. 4. The patterns of the results from I-Boost-CV and I-Boost-Permutation are similar. For the KIRC and pancancer data sets, the majority of the models that contain both clinical and genomic variables yield positive NRI, which implies that they provide better prediction than the model with clinical variables only. Most NRI values under I-Boost-CV are close to 0.2; in biomarker studies, an NRI of 0.2 is considered an intermediate-level improvement [30]. For the LUAD data set, only a few models that contain both clinical and genomic variables provide better prediction than the model with clinical variables only. These results indicate that in certain cancer types, genomic variables contribute to survival prediction in the presence of clinical variables, and the magnitude of the contribution can be large. When the same comparisons are made using LASSO or elastic net, however, the inclusion of genomic variables in the models does not appreciably improve prediction.

Evaluation of gene expression modules
To compare the performance of gene modules versus individual gene expression data, we calculated the NRI values between models with each type of gene expression data separately. Specifically, for each combination of data types other than individual gene expression data and gene modules, we computed the NRI between the model with those data types and gene modules (estimated by I-Boost-CV or I-Boost-Permutation) and that with those data types and individual gene expression data. The NRI values are shown in Fig. 5. Under both methods, the use of gene modules leads to substantially better prediction than the use of expression data of all individual genes for the LUAD data set. For the KIRC and pan-cancer data sets, the performance of the two types of gene expression data is similar, and there is no strong evidence favoring gene modules or individual gene expression data on the basis of prediction accuracy. Nevertheless, because gene modules are smaller in number and much easier to interpret, we generally recommend the use of gene modules over individual gene expression data.

Comparison among genomics data types
To evaluate the relative prognostic value of each genomics data type, we formed a series of nested models as follows. We began by setting the model with clinical variables only as the first member of the series of models. At each later step, we computed the NRI between each model that contains all currently included data types and an extra genomics data type and the model included at the previous step. The model that yielded the largest NRI was set to be the next member of the series of models. The process was repeated until all data types were included. (Individual gene expression data were not considered in this analysis.) At each step of the process, the data type NRI values between models containing individual gene expression data and models containing gene modules under the TCGA LUAD, KIRC, and pan-cancer data sets. Each dot represents the average NRI obtained by fitting I-Boost-CV or I-Boost-Permutation on two sets of predictors over 30 training and testing data set pairs. The first set of predictors contains a combination of data types and gene modules; the second set of predictors contains the same combination of data types and individual gene expression data. A positive NRI represents better prediction using the model with gene modules that yielded the largest improvement in predictive power (over the data types already included) was selected, so that more predictive data types tend to be included earlier, and the order in which the data types entered the models reflects their relative importance. We performed this procedure for elastic net and the two versions of I-Boost. For the LUAD, KIRC, and pan-cancer data sets, the NRI values and their 95% confidence intervals for the series of models are plotted in Fig. 6, and the data type selected at each step is shown. We also plotted the C-index against the number of variables selected for each model.
Because different methods vary in their abilities to extract useful information from given data types, the orders of data types determined by the methods are generally different. For the LUAD, KIRC, and pan-cancer data sets, the NRI under I-Boost-CV or I-Boost-Permutation tends to be positive or around zero with the inclusion of each new data type. This indicates that I-Boost extracts useful information from each additional data type and that its performance tends not to be worsened by the inclusion of additional variables.
I-Boost-Permutation always selects the smallest number of variables, followed by elastic net and I-Boost-CV. This finding is consistent with the conclusions from the simulation studies. Because the C-index obtained by I-Boost-Permutation is higher in most cases than that obtained by elastic net, we conclude that I-Boost-Permutation provides the same or better prediction using fewer variables than elastic net.
For the LUAD and pan-cancer data sets, gene modules are the first genomics data type selected under both versions of I-Boost, and the inclusion of gene modules leads to considerable improvement in prediction accuracy. For the KIRC data set, miRNA expression data are first selected by I-Boost-CV, while gene modules are first selected by I-Boost-Permutation. For I-Boost-CV, however, the model with clinical variables and gene modules yields an NRI of 0.19, which represents a substantial improvement over the model with clinical variables only. The confidence intervals of the NRI include zero due to the small sample sizes of the testing data sets. Nevertheless, the pattern of consistent positive NRI values shown in Fig. 4 and the fact that the NRI values are averages over 30 data splits suggest that the improvement in prediction accuracy is robust. For both versions of I-Boost, after the inclusion of the first genomics data type, the improvement in prediction accuracy with the inclusion of additional data types is marginal. We conclude that gene modules are overall the most predictive genomics data type, and the remaining genomics data types tend not to provide extra predictive power beyond clinical variables and gene modules.
We also evaluated the prognostic value of genomics data in the absence of clinical data. The average C-index values for combinations of genomics data types over 30 training and testing data splits for the LUAD, KIRC, and pancancer data sets are given in Additional file 1: Fig. S1. The maximum C-index values obtained using genomics data types alone are 0.64, 0.72, and 0.74 in the LUAD, KIRC, and pan-cancer data sets, respectively; they are substantially smaller than the corresponding maximum values obtained using both clinical and genomics data. For the LUAD data set, miRNA expression data alone yield the Fig. 6 Analysis results for the TCGA LUAD, KIRC, and pan-cancer data sets, using elastic net, I-Boost-CV, and I-Boost-Permutation on nested models. In the left panel, the leftmost dots are fixed at zero, and each remaining dot represents the average NRI obtained by fitting elastic net, I-Boost-CV, or I-Boost-Permutation over 30 training and testing data set pairs. Each dot except the leftmost dots represents the maximum NRI between a model that contains one more data type than the model corresponding to the dot on the left and the model corresponding to the dot on the left. Above each dot, the name of the additional data type is included. In the right panel, the average C-index values and the average numbers of selected variables for the models shown in the left panel are plotted. The arrows indicate the orders of models with respect to the number of data types they contain. See the caption of Fig. 1 for the abbreviations of the data types largest C-index, whereas for the KIRC data set, the combination of miRNA expression and protein expression data yield the largest C-index. For the pan-cancer data set, the C-index values for combinations of genomics data types with individual gene expression data are almost identical and are larger than those obtained without individual gene expression data.

Important predictors for the LUAD, KIRC, and pan-cancer data sets
To obtain the final models of important predictors, we performed I-Boost-Permutation on the LUAD, KIRC, and pan-cancer data sets. The final models are shown in Tables 1, 2, and 3 for the LUAD, KIRC, and pancancer data sets, respectively. The predictors that are also selected by LASSO, elastic net, and I-Boost-CV are marked.
Age and pathological nodal status are negatively associated with survival time in the LUAD, KIRC, and pancancer data sets. Age has been reported to be prognostic for many cancer types [31][32][33]. In the analysis of the pan-cancer data set, cancer types were selected, which is logical, since the survival time is known to depend on cancer types [26]. Thus, the tissue of origin remains an important prognostic factor. Among the gene modules, Glycolysis_signature and MUnknown_24 are negatively associated with survival time in the LUAD and pancancer data sets; these two modules are correlated with Hypoxia signatures among a set of 1198 TCGA breast cancer patients. Likewise, Pcorr_IGS_Correlation and Activate.Endothelium, which are negatively associated with survival time in the pan-cancer data set, are correlated with proliferation signatures; the latter are known to be negatively associated with survival time. Note: "Estimate" is the estimate of the log hazard ratio under the Cox proportional hazards model, where a positive value represents an increase of the hazard. The predictors are standardized to have unit standard deviation. Gender is coded as female = 0 and male = 1; pathologic stage T is dichotomized into T1 (0) and T2-T4 (1); pathologic stage N is dichotomized into N0 (0) and N1-N3 (1). Predictors that are also selected by LASSO, elastic net, and I-Boost-CV are marked with an asterisk (*) In contrast, signatures of CD8 T cells, non-inflammatory breast cancer (nIBC and MM_Red2), and luminal features (Mature_LuminalUp, GP7_estrogen signaling, HS_ Green1, HS_Green8, LUMINAL_Cluster, Duke_Module06_er, Pcorr_Dasatinib_L_Correlation, and HS_Green18) are positively associated with survival time in the KIRC or pan-cancer data sets. The NEU_cluster module is positively associated with survival time in the LUAD data set, which is biologically significant because this module represents epithelial luminal cell differentiation and thus tracks more differentiated and lower grade lung cancers. The selected features, many of which are also selected by other variable selection methods, have significant biological implications and demonstrate the robustness of the I-Boost methodology.

Conclusions
In this paper, we present a novel method, termed I-Boost, for variable selection and outcome prediction that is especially powerful when one wishes to simultaneously consider multiple genomics and/or proteomics data types. We used simulation studies and real data to demonstrate that in the presence of multiple data types with diverse signal strength, I-Boost produces better outcome prediction than LASSO and elastic net. We proposed two versions of I-Boost, namely I-Boost-CV and I-Boost-Permutation. I-Boost-CV yields  Table 1 more accurate prediction than I-Boost-Permutation, but it generally selects many more variables and is computationally more intensive. By contrast, I-Boost-Permutation is computationally efficient and selects much fewer variables, which may be preferable for follow-up experiments. Consistent with the current literature, we found that clinical variables are strong predictors of survival time. With I-Boost, we were able to build upon the clinical variables and extract additional useful information from genomic variables in order to improve the prediction; the improvement that we obtained with I-Boost was considerably larger than that obtained by either LASSO or elastic net. We also compared the use of individual gene expression data versus gene modules and found that the use of gene modules leads to improvement in prediction accuracy and more interpretable results. When we considered the selected I-Boost models, clinical variables (e.g., age, tumor size, and pathological nodal status) were strong predictors of survival. The I-Boost methods also selected several gene modules that were previously identified as prognostic of outcomes, whether positive or negative.
Our study has limitations. The main limitation is that the LUAD and KIRC data sets pertain to a relatively small number of patients, with an even smaller number of observed events. This limitation motivated us to combine eight solid epithelial tumor types to form a large pan-cancer data set. The analyses on the pan-cancer data might not properly account for heterogeneity across different cancer types. Another limitation of our study is that the quality of the clinical data varies across different cancer types; for example, the follow-up time for some cancer types was quite short.
In summary, we demonstrated that the performance of I-Boost is superior to that of elastic net and LASSO and that the performance of gene modules is superior to that of the totality of individual genes. The I-Boost methodology is applicable to any disease states where multiple types of genomics and/or proteomics data are available and thus has potential applications beyond cancer studies.

Data description
TCGA provides a large open-access database that includes clinical and genomics data for patients with 33 cancer types or subtypes. Herein, we focused on eight cancer types or subtypes, namely, LUAD, KIRC, colon adenocarcinoma (COAD), rectal adenocarcinoma (READ), lung squamous cell carcinoma (LUSC), bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), and head and neck squamous cell carcinoma (HNSC). For clinical variables, somatic mutation, copy number variation, mRNA expression, and miRNA expression, data on 2272 patients were obtained from the December 22, 2012, Pan-Cancer-12 data freeze from the Sage Bionetworks Repository Synapse [34]; the data were previously processed and described by Hoadley et al. [26]. Protein expression data were downloaded from Broad GDAC Firehose [35] for a subset of 1779 patients included in the data set of Hoadley et al. [26].
Clinical variables included gender, age, pathological stages T and N, and cancer type. In all analyses, COAD and READ were considered as one cancer type. For mRNA expression data, we used RNA-seq by Expectation-Maximization (RSEM) [36] to quantify the transcript abundances measured by RNA sequencing and used the log2-transformed up-quantile-normalized RSEM values of 12,434 genes. The RNA sequencing was performed at the University of North Carolina at Chapel Hill [37][38][39]. Gene level expression data are also available on the Broad GDAC Firehose [35]. For mutation data, we used the single nucleotide variant calls, which were de-duplicated and re-annotated using the Ensembl version 69 transcript database. A total of 130 genes with non-synonymous mutations in more than 10% of the whole sample were included for the analyses. The combined mutation annotation format file is available from the Synapse resource. For miRNA expression data, we used the read count data for 305 normalized expressions, which were compiled into an abundance matrix for 5p and 3p mature miRBase strands [37]. For reversephase protein arrays, we used the level 3 normalized data for 136 proteins or phospho-proteins. For copy number data, SNP6.0 array-based gene-level somatic copy number alteration data were generated from the GISTIC analysis [40]. The input data matrix is available in Synapse at syn1710678. We used the copy number values for 216 cancer-specific segments, which are frequently altered in cancer of various types including breast cancer, and segments for all chromosome arms (a total of 41 segments) [41,42].
We defined gene modules as sets of co-expressed genes that are considered to be functional units in breast cancer. We built a collection of 497 gene modules. The modules were constructed on the basis of 73 publications or results from the Gene Set Enrichment Analysis [43]. A partial list of the modules appears in Fan et al. [12]. Among the modules, 461 are median expression values for homogeneously expressed genes, 33 are correlations of expression values with predetermined gene centroids, and 3 are built from previously published gene expression prognostic models.
After removing patients with missing data, the total sample size was 1420, including 202 LUAD patients and 195 KIRC patients. All survival times were censored at 5 years if the patients were still in the study at that time point. For the pan-cancer data set, the median followup time was 16.8 months, and the censoring rate was 77.6%. For the subset of LUAD patients, the median follow-up time was 13.9 months, and the censoring rate was 71.3%. For the subset of KIRC patients, the median follow-up time was 28.9 months, and the censoring rate was 63.6%.

LASSO and elastic net
We implemented LASSO and elastic net using the Rpackage "glmnet" [44] and used fivefold cross-validation to select the tuning parameters. For elastic net, crossvalidation was performed over a two-dimensional grid of (α, λ), while for LASSO, α was set to be 1. For elastic net, the grid for α was chosen to be (0.05, 0.1, 0.2, . . . , 1.0), and a grid for λ was chosen separately for each α using the default settings of glmnet. (A minimum value of 0.05 was considered for α, because α too close to 0 may result in too many variables being selected; in particular, no sparsity is imposed if α = 0.) To make the selection procedure more stable, we repeated the split and evaluation procedure five times, and the cross-validation errors were averaged over the five repetitions.

I-Boost
The I-Boost algorithm is given as follows: using the coordinate-descent algorithm [44], where is the partial likelihood with offset term f and covariates X (k) , α m and λ m are tuning parameters, f = (f 1 , . . . , f n ) , and  (f m,1 , . . . , f m,n ) .
At the mth iteration, only the regression parameters corresponding to the k m th data type are updated. We refer to the d-vector with value β (m) at the positions corresponding to the k m th data type and zero elsewhere as the current estimate at the mth iteration. The current estimate at each iteration contributes to the final parameter estimate additively, and the final parameter estimate is simply the sum of the current estimates obtained from all steps multiplied by v.
I-Boost-CV and I-Boost-Permutation use cross-validation and permutation, respectively, to choose (k m , α m , λ m ) at step 2(a). For I-Boost-CV, we adopt fivefold crossvalidation separately at each iteration over a threedimensional grid on {1, . . . , is a value large enough to shrink the current estimate to zero.
For I-Boost-Permutation, we first perform LASSO separately for each data type X (k) (k = 1, . . . , K) with tuning parameter λ (k) m , where λ (k) m is selected using the permutation method proposed by Sabourin et al. [25]; the permutation method is only applicable to LASSO. The procedure is motivated by the principle that in a null model, i.e., in the absence of any relevant predictors, the tuning parameters should be chosen such that no variable is selected. The permutation selection procedure first generates hypothetical null models by randomly permuting (Y i , i , f m−1,i ) B times at each iteration, so that in each permuted data set, the association between the predictors and the outcome (and the offset term) is removed. The procedure then finds the smallest λ such that no variable is selected for each permuted data set and selects the median of the B values of λ. For the kth data type (k = 1, . . . , K), let λ (k) m be the selected tuning parameter and β (k) m be the corresponding LASSO estimate. We select k m based on the partial-likelihood value at β (k) m , i.e., k m = argmax k L (k) (f m−1 ; β (k) m ), and set α m = 1 and Empirical studies suggested that a small value of the step length factor v often improves and almost never worsens the performance of boosting [45]. Therefore, it is recommended that v is chosen to be as small as possible while the algorithm remains computationally feasible. In the settings we have considered, the performance of I-Boost is not sensitive to v within the range of v ∈ (0.05, 0.5). Therefore, we set v to a moderately small value of 0.1.
Conventional boosting methods require a stopping criterion to avoid over-fitting. In our experience, however, because the tuning parameters are selected separately at each iteration for I-Boost, they eventually lead to shrinkage of all (current) parameter estimates. Therefore, we do not adopt a separate procedure to determine the stopping time of the iteration. We terminate the iteration when f m remains constant for five consecutive iterations.

Simulation studies
In the simulation studies, we considered all data types except individual gene expression data. For each simulation data set, we generated the predictors by sampling without replacement whole vectors of predictors from the TCGA pan-cancer data set. We generated the survival time from a proportional hazards model with the baseline hazard function h 0 (t) = t and generated the censoring time from an exponential distribution with a mean chosen to result in censoring proportion of about 50%. We set the sample size n to 500 in all settings.
The regression parameters were chosen to produce a different proportion of signals across data types, where the signal of data type k is defined to be Var(X (k) β (k) 0 ), and the predictors were standardized. The variables with non-zero regression parameters, hereafter referred to as signal variables, were chosen to be weakly correlated. We considered three settings, with the distributions of signals and number of signal variables shown at the bottom of Fig. 1. In all settings, the signals of all data types sum to 1.2, and the regression parameters of signal variables of the same data type are equal; based on simulation studies not presented, the relative performance of different methods is very similar under different values of total signal. In Setting 1, the clinical variables contain much stronger signals than the other data types. Mutation and copy number variation data do not contain any signal. In Setting 2, all signals are concentrated on the clinical variables and gene modules, and the two data types equally share the signals. In Setting 3, the clinical variables contain the most signals, and the remaining signals are evenly distributed across the other data types.
Because we considered a total of six data types, I-Boost-CV is computationally demanding. To lessen the computational burden, we set v = 0.2 instead of the value 0.1 used in real data analysis.

Assessment of prediction
To assess an analysis method, we split the data into 30 training and testing sets with a 3:2 ratio of sample sizes. We used the R-package "sampling" [46] to perform the data split, such that the distributions of the clinical variables in the training and testing sets are approximately equal. We performed the analysis on the training sets, and the results were assessed on the corresponding testing sets using the C-index. For each split of the data, we repeated this estimation-validation procedure on different combinations of data types as predictors. We only consider combinations of data types that include clinical variables, because clinical variables are almost always considered in practice, and one of the main objectives of this paper is to evaluate the prognostic value of the combination of genomics and clinical data. The analyses were conducted on the 30 splits of the data and on the 48 combinations of data types for the LUAD, KIRC, and pan-cancer data sets.
To quantify the prediction accuracy, we used the Cindex. Let T i be the survival time and X i be a vector of predictors for the ith subject, and let β be a vector of regression parameters. The risk score is defined as X i β. If T i and X i β are continuous, then the C-index is defined as P(X i β > X j β | T i < T j ). The C-index is the probability that for a random pair of subjects in which the first subject has a shorter survival time, the risk score for the first subject is higher. Thus, C-index measures how well the risk score aligns with the actual survival time. For each pair of training and testing sets, we set β to be the parameter estimate obtained from the training set and estimated the C-index for the testing set using the method of Pencina and D' Agostino [27]. If no variable was selected, then a C-index value of 0.5 was assigned.
We used the NRI to compare the prediction accuracy of a model of interest and a baseline model. Let T be the survival time, X andX be vectors of predictors for the model of interest and the baseline model, respectively, and β andβ be the corresponding vectors of regression parameters. Let q andq be the (estimated) survival probabilities at a fixed time point t 0 given univariate covariates X β andX β , respectively, where t 0 is a survival-time threshold, such that subjects with T < t 0 are considered high risk. The NRI is defined as P (q <q | T < t 0 ) − P (q <q | T > t 0 ). A large NRI means that by switching from the baseline model to the model of interest, the direction of change of the predicted risk aligns with the actual survival time for a large proportion of subjects. To compute the NRI between two models using a pair of training and testing sets, we set (β,β) to be the parameter estimates obtained from the training set and calculated (q,q) on the testing set. We estimated the NRI and its confidence interval on the testing set using the method of Uno et al. [29]. The reported NRI values and confidence limits are the average values over 30 training and testing data splits. Note that this NRI is one half the value of the NRI(> 0) defined in Pencina et al. [28,30].

Additional file
Additional file 1: Fig. S1. The prognostic value of genomics data types. (PDF 38 kb)