- Open Access
Revisit linear regression-based deconvolution methods for tumor gene expression data
© The Author(s). 2017
- Received: 26 May 2017
- Accepted: 12 June 2017
- Published: 5 July 2017
We have recently published a statistical deconvolution method to study infiltrating immune cells using tumor RNA-seq data . One of the goals in that work was to understand how proportions of different cell types co-vary across different cancer tissues. To this end, we estimated the abundance of six cell types over 9000 tumor samples across 23 cancer types, and then assessed the correlations of these estimated proportions across the different samples within a cancer type. In particular we compared our method (TIMER) with CIBERSORT , a previously published deconvolution approach, for their ability to assess such correlations. To our surprise, we found many non-biological negative correlations between CIBERSORT estimates, and we believed that this artifact was, to a large extent, due to the incorporation of highly similar features in the linear model, or statistical collinearity. Newman et al., the authors of CIBERSORT, have raised concerns that these correlations were due to data normalization, instead of collinearity . While we agree with Newman and coauthors that the forced normalization indeed introduces unwanted negative correlations, we will show in this response that the inclusion of highly similar features contributes as significantly as normalization, if not more, to the observed artificial negative correlations among the estimates obtained by CIBERSORT.
Highly correlated features (covariates) in linear regression models can lead to many technical difficulties, such as high estimation variances, non-robustness, and non-identifiability. Furthermore, it is often misleading to interpret their coefficients at their face value. For example, it is very easy to create examples where when only one of the two highly similar features is included in a regression model, its coefficient is highly significant and positive; whereas when both are included, none of the coefficients is significant or one is positively significant and the other is negative. This issue is a fundamental statistical problem due to lack of information and is unlikely to be solved simply by regularization employed by the CIBERSORT method.
In the second simulation, we replaced CD8 T cells and neutrophils with two highly correlated features: naïve and memory B cells. According to the LM22 matrix, the expression levels are highly correlated (r = 0.9). We performed the same simulation and used the online CIBERSORT server to infer f1 and f2. The estimated fractions of the two cell types had even stronger negative correlation (r = –0.7; Fig. 1c, d). These results indicate that, in addition to data normalization, the incorporation of highly correlated features will exaggerate the non-biological negative correlations between the estimated coefficients.
Associations of CIBERSORT estimates for closely related features
Naive versus memory B cell (expected r = 0.7)
Activated versus resting CD4 memory T cell (expected r = 0.3)
r = −0.34 ⍴ = −0.77
r = −0.05 ⍴ = −0.06
Kidney renal clear cell carcinoma
r = −0.07 ⍴ = −0.29
r = −0.12 ⍴ = −0.13
Lung squamous carcinoma
r = 0.13 ⍴ = −0.38
r = −0.26 ⍴ = −0.19
r = −0.07 ⍴ = −0.37
r = −0.29 ⍴ = −0.26
The above analyses indicate that incorporating cells with similar features causes CIBERSORT to produce results with non-biological associations.
where i indicates the tumor sample, j indexes the immune cell type, and g stands for an LM22 gene, with constraints ∑ j = 1 22 f j i = 1 and f j i ≥ 0, for ∀ i, j. An important hidden assumption of this model is that malignant cells in the tumor tissue do not express a significant amount of any of the LM22 genes, which have been selected a priori based on mRNA expression data profiled from sorted immune cells. However, due to genome instability, it is possible that malignant cells also express immune-related genes.
To examine if this is true, we analyzed the LM22 signature genes using The Cancer Genome Atlas data. We found that, in multiple cancers, a substantial fraction of the 513 LM22 signature genes showed positive correlations with purity (Fig. 2). Such a correlation suggests that samples with higher tumor content express these genes at higher levels, indicating that these genes are also expressed in malignant cells. In their analysis, Newman et al.  in silico mixed colon cancer cell line expression data with immune subsets to show that CIBERSORT works for tumor tissues. In our analysis, we found that colon cancer expresses a very small number of LM22 genes (n = 56), explaining why CIBERSORT may work well for this cancer type. However, given that up to a quarter of the LM22 genes are not immune-specific in many other cancer types, the model assumption of CIBERSORT is frequently violated. As a consequence, it is likely that the CIBERSORT inferences derived from these genes are confounded by cancer cell expression. This is a possible reason that CIBERSORT  failed to identify putative prognostic factors in these cancers, such as T cells in melanoma and ovarian cancer, or macrophages in glioma.
Finally, we would like to re-emphasize that CIBERSORT and TIMER target different aspects of tumor immune infiltrates. CIBERSORT infers the relative fractions of immune subsets in the total leukocyte population, while TIMER predicts the abundance of immune cells in the overall tumor microenvironment. Currently both methods are limited by the assumption that transcriptomes of tumor-infiltrating immune cells do not significantly differ from those collected from peripheral blood of healthy donors. This is a convenient assumption based on practicality but may not hold for many tumors. Future deconvolution methods could continue to improve, with more studies profiling the tumor-infiltrating immune subsets or single-cell tumor transcriptomes to generate high-quality reference data.
This work is supported by NCI grant 1U01 CA180980.
Availability of data and materials
BL performed the analysis and wrote the responses. JSL and XSL supervised this study. All authors read and approved the final manuscript.
BL is a postdoctoral research fellow. Both JSL and XSL are tenured professors at Harvard University.
The authors declare that they have no competing interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Li B, Severson E, Pignon JC, Zhao H, Li T, Novak J, et al. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;17(1):174. doi:10.1186/s13059-016-1028-7.View ArticlePubMedPubMed CentralGoogle Scholar
- Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12(5):453–7. doi:10.1038/nmeth.3337.View ArticlePubMedPubMed CentralGoogle Scholar
- Newman AM, Gentles AJ, Liu CL, Diehn M, Alizadeh AA. Data normalization considerations for digital tumor dissection. Genome Biol. 2017. doi:10.1186/s13059-017-1257-4.
- Nagorsen D, Voigt S, Berg E, Stein H, Thiel E, Loddenkemper C. Tumor-infiltrating macrophages and dendritic cells in human colorectal cancer: relation to local regulatory T cells, systemic T-cell response against tumor-associated antigens and survival. J Transl Med. 2007;5:62. doi:10.1186/1479-5876-5-62.View ArticlePubMedPubMed CentralGoogle Scholar
- Mabbott NA, Baillie JK, Brown H, Freeman TC, Hume DA. An expression atlas of human primary cells: inference of gene function from coexpression networks. BMC Genomics. 2013;14:632. doi:10.1186/1471-2164-14-632.View ArticlePubMedPubMed CentralGoogle Scholar
- Gentles AJ, Newman AM, Liu CL, Bratman SV, Feng W, Kim D, et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat Med. 2015;21(8):938–45. doi:10.1038/nm.3909.View ArticlePubMedPubMed CentralGoogle Scholar