 Method
 Open Access
 Published:
IBoost: an integrative boosting approach for predicting survival time with multiple genomics platforms
Genome Biologyvolume 20, Article number: 52 (2019)
Abstract
We propose a statistical boosting method, termed IBoost, to integrate multiple types of highdimensional genomics data with clinical data for predicting survival time. IBoost provides substantially higher prediction accuracy than existing methods. By applying IBoost to The Cancer Genome Atlas, we show that the integration of multiple genomics platforms with clinical variables improves the prediction of survival time over the use of clinical variables alone; gene expression values are typically more prognostic of survival time than other genomics data types; and gene modules/signatures are at least as prognostic as the collection of individual gene expression data.
Background
Prediction of disease outcomes, such as individual patient survival time, is critically important for cancer patients. Traditional prognostic models that rely solely on clinical variables, such as age and tumor stage, fail to account for the molecular heterogeneity of tumors and thus may lead to suboptimal treatment decisions [1]. To remedy this situation, many studies have incorporated gene expression data in survival prediction [2–5].
Largescale genomics projects such as The Cancer Genome Atlas (TCGA) have generated detailed molecular data on patients with a variety of cancer types. In TCGA, six types of “omics” data have been collected on the same set of patients: DNA copy number variation, somatic mutation, mRNA expression, microRNA expression, DNA methylation, and expression of ∼ 200 proteins/phosphoproteins. The availability of multiple data types has enabled researchers to address a variety of important questions. For example, patients can be more precisely classified into molecular subtypes based on integrative clustering of multiple genomics data types or platforms [6–8]. In addition, it is possible to identify genes that are related to patient survival time by decomposing the expression of each gene into a component that is explained by the methylation level and a component that is not [9].
One unsolved issue in cancer genomics is the prognostic value of integrated genomics and clinical data versus clinical data only. Yuan et al. [10] compared models with clinical data only versus models with both clinical and genomics data on various cancer types and concluded that genomics data provide only a limited gain in survival prediction accuracy. In their analysis, however, potential differences among data types were not taken into account. For breast cancer, for instance, the combination of genomics and clinical data has been shown to improve outcome predictions [11, 12]. A major goal of the present work is to fully explore the predictive power of integrating clinical and genomics data together.
A second unsolved issue is the prognostic value of individual gene expression values (∼ 25,000) versus a predefined set of gene expression signatures or “modules” (∼ 500). Gene modules have been developed for representing distinct cell types (e.g., epithelial, immune, and endothelial), specific biological processes, or activated molecular signaling pathways. They have been shown to successfully capture signaling pathway activities or cell type heterogeneity within tumors. We wish to investigate whether individual gene expression data or existing gene modules provide more accurate outcome prediction.
A third unsolved issue is the relative importance of different types of genomics data in outcome prediction. Different data types are collected at different costs and also with widely varying feature spaces. Naturally, not all data types are equally important in outcome prediction. We aim to determine which data types may be omitted from analysis without a significant reduction in prediction accuracy.
An overarching methodological challenge in addressing the aforementioned issues is the identification of genomic variables predictive of survival time when the number of variables is much larger than the sample size. Penalized regression methods, such as least absolute shrinkage and selection operator (LASSO) [13] and elastic net [14], are commonly used to identify important genomic variables. When variables are highly correlated, elastic net tends to have better performance in prediction than LASSO [14]. However, both LASSO and elastic net are generic variable selection procedures that do not distinguish different types of data and thus tend to select more variables from the data types with larger numbers of variables. Because different data types capture different biological structures, both large and small data types may carry important signals. Methods that treat all variables equally may not be able to pick out independent signals from small data types. In addition, LASSO and elastic net impose the same penalty on all regression parameters, which may be overly restrictive because the number of variables and the signal strength vary drastically across data types.
Boosting is an alternative to penalization for model estimation and prediction in highdimensional settings. It was originally developed for binary classification in machine learning [15, 16]. The idea of boosting is to iteratively reweight the observations, with larger weights given to observations that are misclassified at the previous iteration, and apply simple classifiers on the reweighted data; their results are then combined to produce an aggregated classification procedure. Boosting was later generalized as a forward stagewise additive modeling method for statistical estimation [17, 18], which can be applied to many problems, including regression analysis for survival data [19]. Because of its flexibility in modeling choices and stability in highdimensional settings, boosting has found applications in genomics studies; see the references in Mayr et al. [20, 21]. As in the case of LASSO and elastic net, however, existing boosting methods, such as componentwise boosting [22], do not distinguish variables of different data types.
To overcome the limitations of LASSO, elastic net, and existing boosting methods, we develop a novel method, termed Integrative Boosting (IBoost), which combines elastic net with boosting. In IBoost, the prediction rule is constructed iteratively, where at each iteration, the predictive power of each data type (conditional on the current prediction rule) is evaluated separately and the most predictive data type is selected to update the prediction rule using elastic net. Thus, independent signal from each data type can be incorporated into the prediction rule, and small but predictive data types will not be dominated by data types with large numbers of variables. In addition, the penalties on the regression parameters are learned dataadaptively and separately for different data types. Herein, we demonstrate the advantages of IBoost using simulation studies and empirical data from the TCGA on patients with eight different cancer types. More importantly, we use IBoost to address the aforementioned three unsolved issues in cancer genomics.
Results and discussion
Background
Suppose that there are K types of clinical or genomics predictors, with d_{k} components for the kth type (k=1,…,K). For k=1,…,K, let X^{(k)} denote the d_{k}vector of predictors of the kth type. Write X=(X^{(1)′},…,X^{(K)′})^{′}, where A^{′} denotes the transpose of A for any vector or matrix A. Let T denote the survival time of interest. We relate T to X through the proportional hazards model [23], such that the conditional hazard function of T given X takes the form of h_{0}(t) exp(β^{′}X), where h_{0}(t) is an arbitrary baseline hazard function, β=(β^{(1)′},…,β^{(K)′})^{′}, and β^{(k)} is a d_{k}vector of regression parameters associated with X^{(k)}.
The survival time T is subject to right censoring by C, such that we observe Y≡ min(T,C) and Δ≡I(T≤C), where I(·) is the indicator function. For a study with n patients, the data consist of (Y_{i},Δ_{i},X_{i})(i=1,…,n). The partial likelihood [24] for β is
LASSO and elastic net
Because X is highdimensional, it is not feasible to estimate β by maximizing the partial likelihood. One possible remedy is to impose sparsity assumptions on β and adopt penalization methods, such as LASSO [13] and elastic net [14]. LASSO estimates β by maximizing the L_{1}penalized logpartial likelihood function
where \(d=\sum _{k=1}^{K} d_{k}\), and λ is a tuning parameter. Elastic net generalizes LASSO by including an L_{2} penalty, such that the objective function becomes
where α∈[ 0,1] is a tuning parameter that controls the relative magnitudes of the L_{1} and L_{2} penalties. (When α=1, elastic net reduces to LASSO.) The implementation of LASSO and elastic net is described in the “Methods” section.
For both LASSO and elastic net, the penalty term dominates under large values of λ, and the parameter estimates tend to be small with some values being exactly zero. Unlike LASSO, elastic net exhibits the grouping effect in that the regression parameters for a group of highly correlated variables tend to be equal, which is desirable in the context of gene selection [14]. Both LASSO and elastic net impose the same penalization on each regression parameter and thus do not distinguish different types of predictors. As a result, these methods may be inefficient when certain data types are much more predictive than others.
IBoost
To account for the differential predictive power of different data types, we propose a boosting algorithm called IBoost. Boosting is an iterative optimization algorithm that minimizes a loss function \(\ell \{\mathcal {Y},\boldsymbol {f}(\mathcal {X})\}\) over a class of functions of predictors \(\boldsymbol {f}(\mathcal {X})\), where \(\mathcal {Y}=(Y_{1},\ldots,Y_{n},\Delta _{1},\ldots,\Delta _{n})\), \(\mathcal {X}=(\boldsymbol {X}_{1},\ldots,\boldsymbol {X}_{n})\), and \(\ell \{\mathcal {Y},\boldsymbol {f}(\mathcal {X})\}\) measures the deviation of the prediction \(\boldsymbol {f}(\mathcal {X})\) from the outcome \(\mathcal {Y}\). At each iteration, we update \(\boldsymbol {f}(\mathcal {X})\) additively by the value \(\boldsymbol {b}(\mathcal {X};\boldsymbol {\beta })\) up to a scaling factor, where b is a fixed basis function, and β is a vector of parameters. Specifically, at the mth iteration, we find β^{(m)} that minimizes \(\ell \{\mathcal {Y},\boldsymbol {f}_{m1}(\mathcal {X}) + \boldsymbol {b}(\mathcal {X};\boldsymbol {\beta }^{(m)})\}\), possibly under some constraints on β^{(m)}, where f_{m−1} is the estimate of f at the (m−1)th iteration. Then, we set \(\boldsymbol {f}_{m}(\mathcal {X})=\boldsymbol {f}_{m1}(\mathcal {X}) + v \boldsymbol {b}(\mathcal {X};\boldsymbol {\beta }^{(m)})\) for some fixed step length factor v∈(0,1]. We terminate the iterations when some stopping criterion is satisfied.
In IBoost, we set the loss function \(\ell \{\mathcal {Y},\boldsymbol {f}(\mathcal {X})\}\) to be the negative logpartial likelihood function and the basis function to be \(\boldsymbol {b}(\mathcal {X};\boldsymbol {\beta }^{(m)})=(\boldsymbol {X}_{1}^{(k)\prime }\boldsymbol {\beta }^{(m)},\ldots,\boldsymbol {X}_{n}^{(k)\prime }\boldsymbol {\beta }^{(m)})^{\prime }\), where \(\boldsymbol {X}^{(k)}_{i}\) is the vector of the kth type of predictors for the ith patient, and the data type k is selected dataadaptively. At each iteration, we search over all data types, select the one that yields the largest decrease in the loss function value at the current iteration, and update (a subset of) the regression parameters corresponding to the selected data type; other parameters are fixed at their current estimated values. To handle highdimensional data, we impose an elastic net penalty on β^{(m)} in the optimization step. Effectively, we perform maximum penalized logpartial likelihood estimation with an offset term \(\boldsymbol {f}_{m1}(\mathcal {X})\) using a single data type at each iteration. Unlike existing boosting methods, such as componentwise boosting, the basis function in our case is a function of all variables of a data type instead of a single variable. This choice of basis function is motivated by the expectations that some data types are much more predictive than others and that the inclusion of less predictive data types may reduce the prediction accuracy of the model. By considering each data type separately, we perform selection on the datatype level at each iteration.
We propose two versions of IBoost, namely IBoostCV and IBoostPermutation, which use crossvalidation and permutation, respectively, to choose the tuning parameters of elastic net at each iteration. The permutation procedure randomly permutes the outcome variables in order to remove association between the predictors and the outcome, and the tuning parameters are chosen such that no predictor is selected in half of the permuted data sets. The procedures are described in detail in the “Methods” section.
Simulation studies
We conducted simulation studies to evaluate the performance of LASSO, elastic net, and the two versions of IBoost. We considered three simulation settings, with different distributions of signals across the data types. In all three settings, a relatively large proportion of the signals is contributed by the clinical variables. The distributions of signals are shown in Fig. 1, and the details of the simulation settings are provided in the “Methods” section.
We assessed the performance of the methods by the quality of prediction and parameter estimation. For prediction, we report the correlation between the estimated risk score \(\sum _{k=1}^{K} \boldsymbol {X}^{(k)\prime } \hat {\boldsymbol {\beta }}^{(k)}\) and the true risk score \(\sum _{k=1}^{K} \boldsymbol {X}^{(k)\prime } \boldsymbol {\beta }^{(k)}_{0}\), where \(\hat {\boldsymbol {\beta }}^{(k)}\) and \(\boldsymbol {\beta }^{(k)}_{0}\) are the estimated and true parameter vectors, respectively. A higher correlation represents a greater degree of agreement between the predicted and actual outcomes. We call this measure the risk correlation. For parameter estimation, we report the meansquared error (MSE), defined as \(\sum _{k=1}^{K}\ \hat {\boldsymbol {\beta }}^{(k)}  \boldsymbol {\beta }^{(k)}_{0}\_{2}^{2}\).
Figure 1 shows the risk correlation and MSE for elastic net, LASSO, and the two versions of IBoost based on 1000 replications; the average number of variables selected for each data type is also shown. IBoostCV always selects the largest number of variables, followed by elastic net, LASSO, and IBoostPermutation. IBoostCV selects a large number variables, because it iteratively performs elastic net, and the final model includes selected variables accumulated over all iterations. By contrast, IBoostPermutation, though iterative, performs LASSO (which generally selects fewer variables than elastic net) with the tuning parameter selected by the very conservative permutation method [25], so that it selects the least number of variables.
For estimation, the MSE under IBoostCV or IBoostPermutation is about 20–40% smaller than that under LASSO or elastic net in all settings. Decomposition of the MSE by data types reveals that the MSE for data types with very weak or no signal is small for IBoost. This result shows that even though IBoostCV selects a relatively large number of variables from these data types, the variables generally have very small estimated regression parameters.
For prediction, the two IBoost methods perform the best overall. In all settings, IBoostCV produces more accurate prediction than all other methods. In Settings 1 and 2, where most signals are concentrated on only one or two data types, IBoostPermutation produces more accurate prediction than both elastic net and LASSO. In Setting 3, IBoostPermutation performs similarly to elastic net, while LASSO performs worse than IBoostPermutation. Between the two versions of IBoost, IBoostCV tends to yield better prediction than IBoostPermutation, possibly because of the larger number of variables selected by IBoostCV. Thus, if the main interest is the selection of relevant variables, then one might consider IBoostPermutation for more conservative variable selection, even though this method is somewhat inferior in prediction when compared to IBoostCV.
We implemented LASSO, elastic net, and the two versions of IBoost using R3.2.2 on a 2.93GHz Xeon Linux computer. On average, performing LASSO, elastic net, IBoostPermutation, and IBoostCV on one simulated data set (that consists of 500 subjects, 6 data types, and 1294 predictors) takes about 2 min, 14 min, 3 h, and 38 h, respectively. IBoostCV is computationally intensive because in each iteration, crossvalidation is conducted on a threedimensional grid. By contrast, in each IBoostPermutation iteration, the tuning parameter α is fixed at 1, no crossvalidation is involved in the selection of λ, and LASSO is performed only once for each data type. Therefore, IBoostPermutation may serve as a computationally efficient alternative to IBoostCV.
Evaluation of LASSO, elastic net, and IBoost using TCGA data
We next evaluated the performance of the methods using three TCGA data sets, namely the lung adenocarcinoma (LUAD) data set, the kidney renal clear cell cancer (KIRC) data set, and a pancancer data set derived from ∼ 1400 patients that represents eight different tumor types considered by Hoadley et al. [26]; see the “Methods” section for a detailed description of the data sets and the evaluation procedure. For each data set, we first split the data 30 times into training and testing sets. We then performed LASSO, elastic net, and the two versions of IBoost for various combinations of data types on patients from the training set of each split. For each combination of data types and each split, we calculated the risk scores for patients in the testing set using the estimates from the corresponding training set, and we used the concordance index (Cindex) [27] to evaluate the prediction accuracy of the risk scores.
The average Cindex values over the splits obtained from LASSO and elastic net are given in Fig. 2. For the KIRC and pancancer data sets, the prediction tends to be much better than random (i.e., the Cindex values are much larger than 0.5). For the LUAD data set, which has a small sample size, some of the models yield relatively poor prediction (with Cindex values smaller than 0.6). For many models, the predictive performance of elastic net is either similar or superior to LASSO.
For LASSO and elastic net, the models containing more data types as predictors do not necessarily perform better than those with fewer data types. One possible explanation is that the extra data types may contain very little relevant information on patient survival, such that adding those data types introduces more noise than signal into the model. In practice, however, it is challenging to decide which data types to consider without prior knowledge of their importance.
Figure 3 shows the average values of the Cindex obtained from elastic net, IBoostCV, and IBoostPermutation for different models. For the LUAD, KIRC, and pancancer data sets, both versions of IBoost provide better prediction than elastic net in almost all cases. The difference in prediction accuracy between IBoost and elastic net is particularly large when the sample size is small and the number of predictors is large. The difference is likely due to the fact that IBoost involves the selection of data types, so that the large and nonpredictive data types would not be selected in most iterations, and their presence would not substantially worsen the prediction accuracy. For the KIRC and pancancer data sets, IBoostCV yields better prediction than IBoostPermutation, whereas for LUAD, there are no clear differences between the two methods.
Prognostic value of integrated clinical and genomics data
To assess whether the genomic variables provide extra predictive power in the presence of the clinical variables, we computed the net reclassification improvement (NRI) [28, 29] values between the models with both clinical and genomic variables (estimated by IBoostCV or IBoostPermutation) and the model with clinical variables only (estimated by maximum partial likelihood estimation). The NRI compares a model of interest with a baseline model and measures how much a subject’s predicted risk under the model of interest, relative to that under the baseline model, aligns with the subject’s survival time. For instance, an NRI of 0.2 means that by switching from the baseline model to the model of interest, the proportion of highrisk subjects being reassigned a larger predicted risk is on average larger, by a value of 0.2, than the proportion of lowrisk subjects being so reassigned; here, highrisk or lowrisk subjects refer, respectively, to those with survival times shorter or longer than a fixed threshold, which we set to be 3 years throughout the paper. (See the “Methods” section for a theoretical definition of the NRI.) The average NRI values over data splits are shown in Fig. 4.
The patterns of the results from IBoostCV and IBoostPermutation are similar. For the KIRC and pancancer data sets, the majority of the models that contain both clinical and genomic variables yield positive NRI, which implies that they provide better prediction than the model with clinical variables only. Most NRI values under IBoostCV are close to 0.2; in biomarker studies, an NRI of 0.2 is considered an intermediatelevel improvement [30]. For the LUAD data set, only a few models that contain both clinical and genomic variables provide better prediction than the model with clinical variables only. These results indicate that in certain cancer types, genomic variables contribute to survival prediction in the presence of clinical variables, and the magnitude of the contribution can be large. When the same comparisons are made using LASSO or elastic net, however, the inclusion of genomic variables in the models does not appreciably improve prediction.
Evaluation of gene expression modules
To compare the performance of gene modules versus individual gene expression data, we calculated the NRI values between models with each type of gene expression data separately. Specifically, for each combination of data types other than individual gene expression data and gene modules, we computed the NRI between the model with those data types and gene modules (estimated by IBoostCV or IBoostPermutation) and that with those data types and individual gene expression data. The NRI values are shown in Fig. 5. Under both methods, the use of gene modules leads to substantially better prediction than the use of expression data of all individual genes for the LUAD data set. For the KIRC and pancancer data sets, the performance of the two types of gene expression data is similar, and there is no strong evidence favoring gene modules or individual gene expression data on the basis of prediction accuracy. Nevertheless, because gene modules are smaller in number and much easier to interpret, we generally recommend the use of gene modules over individual gene expression data.
Comparison among genomics data types
To evaluate the relative prognostic value of each genomics data type, we formed a series of nested models as follows. We began by setting the model with clinical variables only as the first member of the series of models. At each later step, we computed the NRI between each model that contains all currently included data types and an extra genomics data type and the model included at the previous step. The model that yielded the largest NRI was set to be the next member of the series of models. The process was repeated until all data types were included. (Individual gene expression data were not considered in this analysis.) At each step of the process, the data type that yielded the largest improvement in predictive power (over the data types already included) was selected, so that more predictive data types tend to be included earlier, and the order in which the data types entered the models reflects their relative importance. We performed this procedure for elastic net and the two versions of IBoost. For the LUAD, KIRC, and pancancer data sets, the NRI values and their 95% confidence intervals for the series of models are plotted in Fig. 6, and the data type selected at each step is shown. We also plotted the Cindex against the number of variables selected for each model.
Because different methods vary in their abilities to extract useful information from given data types, the orders of data types determined by the methods are generally different. For the LUAD, KIRC, and pancancer data sets, the NRI under IBoostCV or IBoostPermutation tends to be positive or around zero with the inclusion of each new data type. This indicates that IBoost extracts useful information from each additional data type and that its performance tends not to be worsened by the inclusion of additional variables.
IBoostPermutation always selects the smallest number of variables, followed by elastic net and IBoostCV. This finding is consistent with the conclusions from the simulation studies. Because the Cindex obtained by IBoostPermutation is higher in most cases than that obtained by elastic net, we conclude that IBoostPermutation provides the same or better prediction using fewer variables than elastic net.
For the LUAD and pancancer data sets, gene modules are the first genomics data type selected under both versions of IBoost, and the inclusion of gene modules leads to considerable improvement in prediction accuracy. For the KIRC data set, miRNA expression data are first selected by IBoostCV, while gene modules are first selected by IBoostPermutation. For IBoostCV, however, the model with clinical variables and gene modules yields an NRI of 0.19, which represents a substantial improvement over the model with clinical variables only. The confidence intervals of the NRI include zero due to the small sample sizes of the testing data sets. Nevertheless, the pattern of consistent positive NRI values shown in Fig. 4 and the fact that the NRI values are averages over 30 data splits suggest that the improvement in prediction accuracy is robust. For both versions of IBoost, after the inclusion of the first genomics data type, the improvement in prediction accuracy with the inclusion of additional data types is marginal. We conclude that gene modules are overall the most predictive genomics data type, and the remaining genomics data types tend not to provide extra predictive power beyond clinical variables and gene modules.
We also evaluated the prognostic value of genomics data in the absence of clinical data. The average Cindex values for combinations of genomics data types over 30 training and testing data splits for the LUAD, KIRC, and pancancer data sets are given in Additional file 1: Fig. S1. The maximum Cindex values obtained using genomics data types alone are 0.64, 0.72, and 0.74 in the LUAD, KIRC, and pancancer data sets, respectively; they are substantially smaller than the corresponding maximum values obtained using both clinical and genomics data. For the LUAD data set, miRNA expression data alone yield the largest Cindex, whereas for the KIRC data set, the combination of miRNA expression and protein expression data yield the largest Cindex. For the pancancer data set, the Cindex values for combinations of genomics data types with individual gene expression data are almost identical and are larger than those obtained without individual gene expression data.
Important predictors for the LUAD, KIRC, and pancancer data sets
To obtain the final models of important predictors, we performed IBoostPermutation on the LUAD, KIRC, and pancancer data sets. The final models are shown in Tables 1, 2, and 3 for the LUAD, KIRC, and pancancer data sets, respectively. The predictors that are also selected by LASSO, elastic net, and IBoostCV are marked.
Age and pathological nodal status are negatively associated with survival time in the LUAD, KIRC, and pancancer data sets. Age has been reported to be prognostic for many cancer types [31–33]. In the analysis of the pancancer data set, cancer types were selected, which is logical, since the survival time is known to depend on cancer types [26]. Thus, the tissue of origin remains an important prognostic factor. Among the gene modules, Glycolysis_signature and MUnknown_24 are negatively associated with survival time in the LUAD and pancancer data sets; these two modules are correlated with Hypoxia signatures among a set of 1198 TCGA breast cancer patients. Likewise, Pcorr_IGS_Correlation and Activate.Endothelium, which are negatively associated with survival time in the pancancer data set, are correlated with proliferation signatures; the latter are known to be negatively associated with survival time.
In contrast, signatures of CD8 T cells, noninflammatory breast cancer (nIBC and MM_Red2), and luminal features (Mature_LuminalUp, GP7_estrogen signaling, HS_Green1, HS_Green8, LUMINAL_Cluster, Duke_Module06_er, Pcorr_Dasatinib_L_Correlation, and HS_Green18) are positively associated with survival time in the KIRC or pancancer data sets. The NEU_cluster module is positively associated with survival time in the LUAD data set, which is biologically significant because this module represents epithelial luminal cell differentiation and thus tracks more differentiated and lower grade lung cancers. The selected features, many of which are also selected by other variable selection methods, have significant biological implications and demonstrate the robustness of the IBoost methodology.
Conclusions
In this paper, we present a novel method, termed IBoost, for variable selection and outcome prediction that is especially powerful when one wishes to simultaneously consider multiple genomics and/or proteomics data types. We used simulation studies and real data to demonstrate that in the presence of multiple data types with diverse signal strength, IBoost produces better outcome prediction than LASSO and elastic net. We proposed two versions of IBoost, namely IBoostCV and IBoostPermutation. IBoostCV yields more accurate prediction than IBoostPermutation, but it generally selects many more variables and is computationally more intensive. By contrast, IBoostPermutation is computationally efficient and selects much fewer variables, which may be preferable for followup experiments.
Consistent with the current literature, we found that clinical variables are strong predictors of survival time. With IBoost, we were able to build upon the clinical variables and extract additional useful information from genomic variables in order to improve the prediction; the improvement that we obtained with IBoost was considerably larger than that obtained by either LASSO or elastic net. We also compared the use of individual gene expression data versus gene modules and found that the use of gene modules leads to improvement in prediction accuracy and more interpretable results. When we considered the selected IBoost models, clinical variables (e.g., age, tumor size, and pathological nodal status) were strong predictors of survival. The IBoost methods also selected several gene modules that were previously identified as prognostic of outcomes, whether positive or negative.
Our study has limitations. The main limitation is that the LUAD and KIRC data sets pertain to a relatively small number of patients, with an even smaller number of observed events. This limitation motivated us to combine eight solid epithelial tumor types to form a large pancancer data set. The analyses on the pancancer data might not properly account for heterogeneity across different cancer types. Another limitation of our study is that the quality of the clinical data varies across different cancer types; for example, the followup time for some cancer types was quite short.
In summary, we demonstrated that the performance of IBoost is superior to that of elastic net and LASSO and that the performance of gene modules is superior to that of the totality of individual genes. The IBoost methodology is applicable to any disease states where multiple types of genomics and/or proteomics data are available and thus has potential applications beyond cancer studies.
Methods
Data description
TCGA provides a large openaccess database that includes clinical and genomics data for patients with 33 cancer types or subtypes. Herein, we focused on eight cancer types or subtypes, namely, LUAD, KIRC, colon adenocarcinoma (COAD), rectal adenocarcinoma (READ), lung squamous cell carcinoma (LUSC), bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), and head and neck squamous cell carcinoma (HNSC). For clinical variables, somatic mutation, copy number variation, mRNA expression, and miRNA expression, data on 2272 patients were obtained from the December 22, 2012, PanCancer12 data freeze from the Sage Bionetworks Repository Synapse [34]; the data were previously processed and described by Hoadley et al. [26]. Protein expression data were downloaded from Broad GDAC Firehose [35] for a subset of 1779 patients included in the data set of Hoadley et al. [26].
Clinical variables included gender, age, pathological stages T and N, and cancer type. In all analyses, COAD and READ were considered as one cancer type. For mRNA expression data, we used RNAseq by ExpectationMaximization (RSEM) [36] to quantify the transcript abundances measured by RNA sequencing and used the log2transformed upquantilenormalized RSEM values of 12,434 genes. The RNA sequencing was performed at the University of North Carolina at Chapel Hill [37–39]. Gene level expression data are also available on the Broad GDAC Firehose [35]. For mutation data, we used the single nucleotide variant calls, which were deduplicated and reannotated using the Ensembl version 69 transcript database. A total of 130 genes with nonsynonymous mutations in more than 10% of the whole sample were included for the analyses. The combined mutation annotation format file is available from the Synapse resource. For miRNA expression data, we used the read count data for 305 normalized expressions, which were compiled into an abundance matrix for 5p and 3p mature miRBase strands [37]. For reversephase protein arrays, we used the level 3 normalized data for 136 proteins or phosphoproteins. For copy number data, SNP6.0 arraybased genelevel somatic copy number alteration data were generated from the GISTIC analysis [40]. The input data matrix is available in Synapse at syn1710678. We used the copy number values for 216 cancerspecific segments, which are frequently altered in cancer of various types including breast cancer, and segments for all chromosome arms (a total of 41 segments) [41, 42].
We defined gene modules as sets of coexpressed genes that are considered to be functional units in breast cancer. We built a collection of 497 gene modules. The modules were constructed on the basis of 73 publications or results from the Gene Set Enrichment Analysis [43]. A partial list of the modules appears in Fan et al. [12]. Among the modules, 461 are median expression values for homogeneously expressed genes, 33 are correlations of expression values with predetermined gene centroids, and 3 are built from previously published gene expression prognostic models.
After removing patients with missing data, the total sample size was 1420, including 202 LUAD patients and 195 KIRC patients. All survival times were censored at 5 years if the patients were still in the study at that time point. For the pancancer data set, the median followup time was 16.8 months, and the censoring rate was 77.6%. For the subset of LUAD patients, the median followup time was 13.9 months, and the censoring rate was 71.3%. For the subset of KIRC patients, the median followup time was 28.9 months, and the censoring rate was 63.6%.
LASSO and elastic net
We implemented LASSO and elastic net using the Rpackage “glmnet” [44] and used fivefold crossvalidation to select the tuning parameters. For elastic net, crossvalidation was performed over a twodimensional grid of (α,λ), while for LASSO, α was set to be 1. For elastic net, the grid for α was chosen to be (0.05,0.1,0.2,…,1.0), and a grid for λ was chosen separately for each α using the default settings of glmnet. (A minimum value of 0.05 was considered for α, because α too close to 0 may result in too many variables being selected; in particular, no sparsity is imposed if α=0.) To make the selection procedure more stable, we repeated the split and evaluation procedure five times, and the crossvalidation errors were averaged over the five repetitions.
IBoost
The IBoost algorithm is given as follows:

1
Set f_{0,i}=0 for i=1,…,n, and let f_{0}=(f_{0,1},…,f_{0,n})^{′}.

2
Consider m=1,2,…:

(a)
For a given k_{m}∈{1,…,K}, calculate
$$\begin{aligned} \boldsymbol{\beta}^{(m)} &=\text{argmax}_{\boldsymbol{\beta}}\left\{\log L^{(k_{m})}(\boldsymbol{f}_{m1};\boldsymbol{\beta}) \right.\\ & \quad\left. p^{(k_{m})}(\boldsymbol{\beta};\alpha_{m},\lambda_{m})\right\} \end{aligned} $$using the coordinatedescent algorithm [44], where
$$L^{(k)}(\boldsymbol{f};\boldsymbol{\beta})\equiv\prod_{i=1}^{n}\left(\frac{e^{\ f_{i}+\boldsymbol{X}^{(k)\prime}_{i}\boldsymbol{\beta}}} {\sum_{j:Y_{j}\ge Y_{i}}e^{\ f_{j}+\boldsymbol{X}^{(k)\prime}_{j}\boldsymbol{\beta}}}\right)^{\Delta_{i}} $$is the partial likelihood with offset term f and covariates X^{(k)}, α_{m} and λ_{m} are tuning parameters, f=(f_{1},…,f_{n})^{′}, and
$$p^{(k)}(\boldsymbol{\beta};\alpha,\lambda)\equiv\lambda\left\{\alpha\sum_{j=1}^{d_{k}}\beta_{j}+\frac{1}{2}(1\alpha)\sum_{j=1}^{d_{k}}\beta_{j}^{2}\right\} $$is the elastic net penalty. The selection of k_{m}, α_{m}, and λ_{m} is described below.

(b)
Set \(f_{m,i}=f_{m1,i}+v\boldsymbol {X}^{(k_{m})\prime }_{i}\boldsymbol {\beta }^{(m)}\) for i=1,…,n with v=0.1 and f_{m}=(f_{m,1},…,f_{m,n})^{′}.

(a)
At the mth iteration, only the regression parameters corresponding to the k_{m}th data type are updated. We refer to the dvector with value β^{(m)} at the positions corresponding to the k_{m}th data type and zero elsewhere as the current estimate at the mth iteration. The current estimate at each iteration contributes to the final parameter estimate additively, and the final parameter estimate is simply the sum of the current estimates obtained from all steps multiplied by v.
IBoostCV and IBoostPermutation use crossvalidation and permutation, respectively, to choose (k_{m},α_{m},λ_{m}) at step 2(a). For IBoostCV, we adopt fivefold crossvalidation separately at each iteration over a threedimensional grid on \(\{1,\ldots,K\}\times [0.05,1]\times (0,\lambda ^{(\max)}_{m})\) for (k_{m},α_{m},λ_{m}), where \(\lambda ^{(\max)}_{m}\) is a value large enough to shrink the current estimate to zero.
For IBoostPermutation, we first perform LASSO separately for each data type X^{(k)} (k=1,…,K) with tuning parameter \(\lambda ^{(k)}_{m}\), where \(\lambda ^{(k)}_{m}\) is selected using the permutation method proposed by Sabourin et al. [25]; the permutation method is only applicable to LASSO. The procedure is motivated by the principle that in a null model, i.e., in the absence of any relevant predictors, the tuning parameters should be chosen such that no variable is selected. The permutation selection procedure first generates hypothetical null models by randomly permuting (Y_{i},Δ_{i},f_{m−1,i})B times at each iteration, so that in each permuted data set, the association between the predictors and the outcome (and the offset term) is removed. The procedure then finds the smallest λ such that no variable is selected for each permuted data set and selects the median of the B values of λ. For the kth data type (k=1,…,K), let \(\lambda ^{(k)}_{m}\) be the selected tuning parameter and \(\boldsymbol {\beta }^{(k)}_{m}\) be the corresponding LASSO estimate. We select k_{m} based on the partiallikelihood value at \(\boldsymbol {\beta }^{(k)}_{m}\), i.e., \(k_{m} = \text {argmax}_{k} L^{(k)}(f_{m1};\boldsymbol {\beta }^{(k)}_{m})\), and set α_{m}=1 and λ_{m}=λm(k_{m}).
Empirical studies suggested that a small value of the step length factor v often improves and almost never worsens the performance of boosting [45]. Therefore, it is recommended that v is chosen to be as small as possible while the algorithm remains computationally feasible. In the settings we have considered, the performance of IBoost is not sensitive to v within the range of v∈(0.05,0.5). Therefore, we set v to a moderately small value of 0.1.
Conventional boosting methods require a stopping criterion to avoid overfitting. In our experience, however, because the tuning parameters are selected separately at each iteration for IBoost, they eventually lead to shrinkage of all (current) parameter estimates. Therefore, we do not adopt a separate procedure to determine the stopping time of the iteration. We terminate the iteration when f_{m} remains constant for five consecutive iterations.
Simulation studies
In the simulation studies, we considered all data types except individual gene expression data. For each simulation data set, we generated the predictors by sampling without replacement whole vectors of predictors from the TCGA pancancer data set. We generated the survival time from a proportional hazards model with the baseline hazard function h_{0}(t)=t and generated the censoring time from an exponential distribution with a mean chosen to result in censoring proportion of about 50%. We set the sample size n to 500 in all settings.
The regression parameters were chosen to produce a different proportion of signals across data types, where the signal of data type k is defined to be \(\text {Var}(\boldsymbol {X}^{(k)\prime } \boldsymbol {\beta }^{(k)}_{0})\), and the predictors were standardized. The variables with nonzero regression parameters, hereafter referred to as signal variables, were chosen to be weakly correlated. We considered three settings, with the distributions of signals and number of signal variables shown at the bottom of Fig. 1. In all settings, the signals of all data types sum to 1.2, and the regression parameters of signal variables of the same data type are equal; based on simulation studies not presented, the relative performance of different methods is very similar under different values of total signal. In Setting 1, the clinical variables contain much stronger signals than the other data types. Mutation and copy number variation data do not contain any signal. In Setting 2, all signals are concentrated on the clinical variables and gene modules, and the two data types equally share the signals. In Setting 3, the clinical variables contain the most signals, and the remaining signals are evenly distributed across the other data types.
Because we considered a total of six data types, IBoostCV is computationally demanding. To lessen the computational burden, we set v=0.2 instead of the value 0.1 used in real data analysis.
Assessment of prediction
To assess an analysis method, we split the data into 30 training and testing sets with a 3 :2 ratio of sample sizes. We used the Rpackage “sampling” [46] to perform the data split, such that the distributions of the clinical variables in the training and testing sets are approximately equal. We performed the analysis on the training sets, and the results were assessed on the corresponding testing sets using the Cindex. For each split of the data, we repeated this estimationvalidation procedure on different combinations of data types as predictors. We only consider combinations of data types that include clinical variables, because clinical variables are almost always considered in practice, and one of the main objectives of this paper is to evaluate the prognostic value of the combination of genomics and clinical data. The analyses were conducted on the 30 splits of the data and on the 48 combinations of data types for the LUAD, KIRC, and pancancer data sets.
To quantify the prediction accuracy, we used the Cindex. Let T_{i} be the survival time and X_{i} be a vector of predictors for the ith subject, and let β be a vector of regression parameters. The risk score is defined as \(\boldsymbol {X}_{i}^{\prime } \boldsymbol {\beta }\). If T_{i} and \(\boldsymbol {X}_{i}^{\prime } \boldsymbol {\beta }\) are continuous, then the Cindex is defined as \(P(\boldsymbol {X}_{i}^{\prime } \boldsymbol {\beta } > \boldsymbol {X}_{j}^{\prime } \boldsymbol {\beta }\mid T_{i}<T_{j})\). The Cindex is the probability that for a random pair of subjects in which the first subject has a shorter survival time, the risk score for the first subject is higher. Thus, Cindex measures how well the risk score aligns with the actual survival time. For each pair of training and testing sets, we set β to be the parameter estimate obtained from the training set and estimated the Cindex for the testing set using the method of Pencina and D’Agostino [27]. If no variable was selected, then a Cindex value of 0.5 was assigned.
We used the NRI to compare the prediction accuracy of a model of interest and a baseline model. Let T be the survival time, X and \(\tilde {\boldsymbol {X}}\) be vectors of predictors for the model of interest and the baseline model, respectively, and β and \(\tilde {\boldsymbol {\beta }}\) be the corresponding vectors of regression parameters. Let q and \(\tilde {q}\) be the (estimated) survival probabilities at a fixed time point t_{0} given univariate covariates X^{′}β and \(\tilde {\boldsymbol {X}}^{\prime }\tilde {\boldsymbol {\beta }}\), respectively, where t_{0} is a survivaltime threshold, such that subjects with T<t_{0} are considered high risk. The NRI is defined as \(P\left (q < \tilde {q}\mid T < t_{0}\right)  P\left (q < \tilde {q}\mid T >t_{0}\right)\). A large NRI means that by switching from the baseline model to the model of interest, the direction of change of the predicted risk aligns with the actual survival time for a large proportion of subjects. To compute the NRI between two models using a pair of training and testing sets, we set \((\boldsymbol {\beta },\tilde {\boldsymbol {\beta }})\) to be the parameter estimates obtained from the training set and calculated \((q,\tilde {q})\) on the testing set. We estimated the NRI and its confidence interval on the testing set using the method of Uno et al. [29]. The reported NRI values and confidence limits are the average values over 30 training and testing data splits. Note that this NRI is one half the value of the NRI(>0) defined in Pencina et al. [28, 30].
References
 1
Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, et al.Gene expressionbased survival prediction in lung adenocarcinoma: a multisite, blinded validation study. Nat Med. 2008; 14:822–7.
 2
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA. 2001; 98:11462–7.
 3
Beer DG, Kardia SL, Huang CC, Giordano TJ, Levin AM, Misek DE, et al. Geneexpression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002; 8:816–24.
 4
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al. Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nat Med. 2002; 8:68–74.
 5
van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002; 415:530–6.
 6
Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009; 25:2906–12.
 7
Mo Q, Wang S, Seshan VE, Olshen AB, Schultz N, Sander C, et al. Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc Natl Acad Sci USA. 2013; 110:4245–50.
 8
Lock EF, Hoadley KA, Marron JS, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013; 7:523–42.
 9
Wang W, Baladandayuthapani V, Morris JS, Broom BM, Manyam G, Do KA. iBAG: integrative Bayesian analysis of highdimensional multiplatform genomics data. Bioinformatics. 2013; 29:149–59.
 10
Yuan Y, Van Allen EM, Omberg L, Wagle N, AminMansour A, Sokolov A, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol. 2014; 32:644–52.
 11
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009; 27:1160–7.
 12
Fan C, Prat A, Parker JS, Liu Y, Carey LA, Troester MA, et al. Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures. BMC Med Genomics. 2011; 4:3.
 13
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996; 58:267–88.
 14
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol. 2005; 67:301–20.
 15
Schapire RE. The strength of weak learnability. Mach Learn. 1990; 5:197–227.
 16
Freund Y, Schapire RE. A decisiontheoretic generalization of online learning and an application to boosting. J Comput Syst Sci. 1997; 55:119–39.
 17
Breiman L. Arcing classifier (with discussion). Ann Stat. 1998; 26:801–49.
 18
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion). Ann Stat. 2000; 28:337–407.
 19
Hothorn T, Bühlmann P, Dudoit S, Molinaro A, Van Der Laan MJ. Survival ensembles. Biostatistics. 2005; 7:355–73.
 20
Mayr A, Binder H, Gefeller O, Schmid M. The evolution of boosting algorithms. Methods Inf Med. 2014; 53:419–27.
 21
Mayr A, Binder H, Gefeller O, Schmid M. Extending statistical boosting. Methods Inf Med. 2014; 53:428–35.
 22
Bühlmann P, Yu B. Boosting with the L2 loss: regression and classification. J Am Stat Assoc. 2003; 98:324–39.
 23
Cox DR. Regression models and lifetables. J R Stat Soc Series B Stat Methodol. 1972; 34:187–220.
 24
Cox DR. Partial likelihood. Biometrika. 1975; 62:269–76.
 25
Sabourin JA, Valdar W, Nobel AB. A permutation approach for selecting the penalty parameter in penalized model selection. Biometrics. 2015; 71:1185–94.
 26
Hoadley KA, Yau C, Wolf DM, Cherniack AD, Tamborero D, Ng S, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014; 158:929–44.
 27
Pencina MJ, D’Agostino RB. Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004; 23:2109–23.
 28
Pencina MJ, D’Agostino Sr RB, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011; 30:11–21.
 29
Uno H, Tian L, Cai T, Kohane IS, Wei L. A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Stat Med. 2013; 32:2430–42.
 30
Pencina MJ, D’Agostino RB, Pencina KM, Janssens ACJ, Greenland P. Interpreting incremental value of markers added to risk prediction models. Am J Epidemiol. 2012; 176:473–81.
 31
Lieu CH, Renfro LA, De Gramont A, Meyers JP, Maughan TS, Seymour MT, et al. Association of age with survival in patients with metastatic colorectal cancer: analysis from the ARCAD Clinical Trials Program. J Clin Oncol. 2014; 32:2975–82.
 32
de la Rochefordière A, Campana F, Fenton J, Vilcoq J, Fourquet A, Asselain B, et al. Age as prognostic factor in premenopausal breast carcinoma. Lancet. 1993; 341:1039–43.
 33
Asmis TR, Ding K, Seymour L, Shepherd FA, Leighl NB, Winton TL, et al. Age and comorbidity as independent prognostic factors in the treatment of non–smallcell lung cancer: a review of National Cancer Institute of Canada Clinical Trials Group trials. J Clin Oncol. 2008; 26:54–9.
 34
Sage Bionetworks Repository Synapse. Multiplatform analysis of 12 cancer types to identify integrative subtypes; https://www.synapse.org/#!Synapse:syn2468297/. Accessed 12 Oct 2015.
 35
Broad Institute TCGA Genome Data Analysis Center. Analysisready standardized TCGA data from Broad GDAC Firehose 2016_01_28 run. Broad Institute of MIT and Harvard; https://doi.org/10.7908/C11G0KM9. Accessed 26 Jun 2017.
 36
Li B, Dewey CN. RSEM: accurate transcript quantification from RNASeq data with or without a reference genome. BMC Bioinformatics. 2011; 12:323.
 37
The Cancer Genome Atlas Research Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490:61–70.
 38
The Cancer Genome Atlas Research Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature. 2012; 487:330–7.
 39
The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012; 489:519–25.
 40
Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. Pancancer patterns of somatic copy number alteration. Nat Genet. 2013; 45:1134–40.
 41
Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, et al. The landscape of somatic copynumber alteration across human cancers. Nature. 2010; 463:899–905.
 42
Chao HH, He X, Parker JS, Zhao W, Perou CM. Microscale genomic DNA copy number aberrations as another means of mutagenesis in breast cancer. PLoS ONE. 2012; 7:e51719.
 43
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA. 2005; 102:15545–50.
 44
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011; 39:1–13.
 45
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29:1189–232.
 46
Tillé Y, Matei A. Sampling: survey sampling. R package version 2.8. 2016.
 47
Wong KY, Fan C, Maki T, Parker JS, Nobel AB, Zeng D, et al. IBoost: an integrative boosting approach for predicting survival time with multiple genomics platforms. Processed data; 2019. https://doi.org/10.5281/zenodo.2530387. Accessed 4 Jan 2019.
 48
Wong KY, Fan C, Maki T, Parker JS, Nobel AB, Zeng D, et al. IBoost: an integrative boosting approach for predicting survival time with multiple genomics platforms. Source code Github repository; 2019. https://github.com/alexwky/IBoost. Accessed 4 Jan 2019.
 49
Wong KY, Fan C, Maki T, Parker JS, Nobel AB, Zeng D, et al. IBoost: an integrative boosting approach for predicting survival time with multiple genomics platforms. Source code; 2019. https://doi.org/10.5281/zenodo.2529986. Accessed 4 Jan 2019.
 50
Wong KY, Fan C, Maki T, Parker JS, Nobel AB, Zeng D, et al. IBoost: an integrative boosting approach for predicting survival time with multiple genomics platforms. Code Github repository; 2019. https://github.com/alexwky/IBoostPaper2019. Accessed 7 Jan 2019.
 51
Wong KY, Fan C, Maki T, Parker JS, Nobel AB, Zeng D, et al. IBoost: an integrative boosting approach for predicting survival time with multiple genomics platforms. Code; 2019. https://doi.org/10.5281/zenodo.2532847. Accessed 7 Jan 2019.
Acknowledgements
Not applicable.
Funding
This work was supported with funds from the NCI Breast SPORE program (P50CA5822309A1), the Breast Cancer Research Foundation, the Susan G. Komen, The V Foundation for Cancer Research, and by National Institutes of Health grants R01CA148761, R01GM047845, R01HG009974, and P01CA142538.
Availability of data and materials
For clinical variables, somatic mutation, copy number variation, mRNA expression, and miRNA expression, data were obtained from Synapse [34] (https://www.synapse.org/#!Synapse:syn2468297). Protein expression data were downloaded from Broad GDAC Firehose [35] (https://doi.org/10.7908/C11G0KM9). The processed data set used in this paper is deposited on Zenodo [47] (https://doi.org/10.5281/zenodo.2530387).
The IBoost software is distributed under the MIT license and can be downloaded from Github [48] (https://github.com/alexwky/IBoost) or Zenodo [49] (https://doi.org/10.5281/zenodo.2529986). The simulated data sets and the codes to reproduce all the analyses presented in this paper are available on GitHub [50] (https://github.com/alexwky/IBoostPaper2019) and deposited on Zenodo [51] (https://doi.org/10.5281/zenodo.2532847).
Author information
Affiliations
Contributions
DYL and CMP coordinated the overall studies. KYW performed the statistical analyses. KYW, ABN, DZ, and DYL developed the statistical methodologies. CF and JSP coordinated the processing of the data. KYW, MT, DYL, and CMP wrote the paper, which all authors reviewed. All authors read and approved the final manuscript.
Corresponding authors
Correspondence to DanYu Lin or Charles M. Perou.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
CMP is an equity stock holder, consultant, and Board of Director Member of BioClassifier LLC and GeneCentric Diagnostics. CMP is also listed as an inventor on patents on the Breast PAM50 and Lung Cancer Subtyping assays. The other authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Fig. S1. The prognostic value of genomics data types. (PDF 38 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Cancer genomics
 Data integration
 Gene modules
 Variable selection