xCell: digitally portraying the tissue cellular heterogeneity landscape

Tissues are complex milieus consisting of numerous cell types. Several recent methods have attempted to enumerate cell subsets from transcriptomes. However, the available methods have used limited sources for training and give only a partial portrayal of the full cellular landscape. Here we present xCell, a novel gene signature-based method, and use it to infer 64 immune and stromal cell types. We harmonized 1822 pure human cell type transcriptomes from various sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry immunophenotyping, we show that xCell outperforms other methods. xCell is available at http://xCell.ucsf.edu/. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1349-1) contains supplementary material, which is available to authorized users.


Introduction
In addition to malignant proliferating cells, tumors are also composed of numerous distinct non-cancerous cell types and activation states of those cell types. This notion, which is termed the tumor microenvironment, has been in the spotlight of research in recent years and is being further explored by novel techniques. The most studied set of non-cancerous cell types are the tumor-infiltrating lymphocytes (TILs). However, these TILs are only part of a variety of innate and adaptive immune cells, stroma cells and many other cell types that are found in the tumor and interact with the malignant cells.
This complex and dynamic microenvironment is now recognized to be important both in promoting and inhibiting of tumor growth, invasion, and metastasis [1,2]. Understanding the cellular heterogeneity composing the tumor microenvironment is key for improving existing treatments, the discovery of predictive biomarkers and development of novel therapeutic strategies.
Traditional approaches for dissecting the cellular heterogeneity in liquid tissues are difficult to apply in solid tumors [3]. Therefore, in the last decade, numerous methods have been published for digitally dissecting the tumor microenvironment using gene expression profiles [4][5][6][7] (Reviewed in [8]). Recently, multitudes of studies have been published applying published and novel techniques on publicly available resources of tumor samples such as The Cancer Genome Atlas (TCGA) [6,[9][10][11][12][13]. There are two general types of techniques: deconvolving the complete cellular composition, and assessing enrichments of individual cell types.
There are at least seven major concerns that the in silico methods could be prone to errors, and cannot reliably portray the cellular heterogeneity of the tumor microenvironment. First, current techniques depend on the expression profiles of purified cell types to identify reference genes and therefore rely heavily on the data source of which the references are inferred from, and could be inclined to overfitting to these data.
Second, current methods portray only a very narrow perspective of the tumor microenvironment. The available methods usually focus on a subset of immune cell types, thus not accounting for the further richness of cell types in the microenvironment, including blood vessels and other different forms of cell subsets [14,15]. A third problem is the ability of cancer cells to "imitate" other cell types by expressing immune-specific genes, such as macrophages-like expression pattern in tumors with parainflammation [16]; only a few of the methods take this into account. Fourth, the ability of existing methods to estimate cell abundance have not yet been comprehensively validated in mixed samples. Cytometry is a common method for counting cell types in a mixture, and when performed in combination with gene expression profiling, can allow validation of the estimations. However, in most studies that included cytometry validation, these analyses were performed on only a very limited number of cell types and a limited number of samples [7,13].
A fifth challenge is that deconvolution approaches are prone to many different biases because of the strict dependencies among all cell types that are inferred. This could highly affect reliability in analyzing tumor samples, which are prone to form nonconventional expression profiles. A sixth problem has been raised with inferring an increasing number of closely related cell types [10]. Finally, deconvolution analysis heavily relies on the structure of the reference matrix, which limits its application to the resource used to develop the matrix. One such deconvolution approach is CIBESORT, which is the most comprehensive study to date, allowing the enumeration of 22 immune subsets [7]. Newman et al. performed adequate evaluation across data sources and validated the estimations using cytometry immunophenotyping. However, the shortcomings of deconvolution approaches are apparent in CIBERSORT, which is limited to Affymetrix microarray studies.
On the other hand, gene set enrichment analysis is a simple technique, which can be easily applied across data types and can be quickly applied for cancer studies. Each gene signature is used independently from all other signatures; thus it is protected from the limitations of deconvolution approaches. However, because of this independence, it is many times hard to differentiate between closely related cell types. In addition, gene signature-based methods only provide enrichment scores, and thus do not allow comparison across cell types, and cannot allow insights on the abundance of the cell type in the mixture.
Here, we present xCell, a novel method that integrates the advantages of gene set enrichment with deconvolution approaches. We present a compendium of newly generated gene signatures for 64 cell types, spanning multiple adaptive and innate immunity cells, hematopoietic progenitors, epithelial cells and extracellular matrix cells derived from thousands of expression profiles. Using in silico mixtures, we transform the enrichment scores to a linear scale, and using a spillover compensation technique we reduce dependencies between closely related cell types. We evaluate these adjusted scores in RNA-seq and microarray data from primary cell types samples from various independent sources. We examine their ability to digitally dissect the tumor microenvironment by in silico analyses, and perform the most comprehensive comparison to date with cytometry immunophenotyping. We compare our inferences with available methods and show that scores from xCell are more reliable in digital dissection of mixed tissues. Finally, we apply our method on TCGA tumor samples to portray a full tumor microenvironment landscape across thousands of samples. We provide these estimations to the community and hope that this resource will allow researchers gain a better perspective of the complex cellular heterogeneity in tumor tissues. The xCell pipeline: Using the data source and based on different thresholds we learn gene signatures for 64 cell types. Of this collection of 6,573 signatures we choose 489 most reliable cell types, 3 for each cell types from each data source were it is available. The raw score is then the average ssGSEA score of all signatures corresponding to the cell type. Using simulations of gene expression for each cell type we learn a function to transform the non-linear association between the scores to a linear scale. Using the simulations we also learn the dependencies between cell types scores and apply a spillover compensation method to adjust the scores.   Next, we used single-sample gene set enrichment analysis (ssGSEA) to score each sample based on all signatures. ssGSEA is a well-known method for aggregating a single score of the enrichment of a set of genes in the top of a ranked gene expression profile [23]. To choose the most reliable signatures we tested their performance in identifying the corresponding cell type in each of the data sources. To prevent overfitting, each signature learned from one data source was tested in other sources, but not in the data source it was originally inferred. To reduce biases resulting from a small number of genes and from the analysis of different platforms, instead of one signature per cell type, the top three ranked signatures from each data source were chosen. Altogether we generated 489 gene signatures corresponding to 64 cell types spanning multiple adaptive and innate immunity cells, hematopoietic progenitors, epithelial cells and extracellular matrix cells (Supplementary Table 2). Observing the scores in the 97 test primary cell types samples affirmed their ability to identify the corresponding cell type compared to other cell types across data sources (Supplementary Figure 2). We defined the raw enrichment score per cell type to be the average ssGSEA scores from all the cell type's corresponding signatures.

Spillover compensation between closely related cell types
Our primary objective is to accurately identify enrichments of cell types in mixtures. To imitate such admixtures, we performed an array of simulations of gene expression combinations of cell types to assess the accuracy and sensitivity of our gene signatures. We generated such in silico expression profiles using different data sources; using different sets of cell types in mixtures; and by choosing randomly one sample per cell type from all available samples in the data source. The simulations revealed that our raw scores reliably predict even small changes in the proportions of cell types, distinguish between most cell types, and are reliable in different transcriptomic analysis platforms (Supplementary Figure 3). However, the simulations also revealed that raw scores of RNA-seq samples are not linearly associated with the abundance and that they do not allow comparisons across cell types (Supplementary Figure 4). Thus, using the training samples we generated synthetic expression profiles by mixing the cell type of interest with another non-related cell types. We then fit a formula that transforms the raw scores to cell-type abundances. We found that the transformed scores showed resemblance to the known fractions of the cell types in simulations, thus allowing to compare scores across cell types, and not just across samples (Supplementary Figure 5).
The simulations also revealed another limitation of the raw scores: closely related cell-types tend to have correlating scores (Supplementary Figure 5). That is, scores may show enrichment for a cell type due to a 'spillover effect' between closely related cell types. This problem mimics the spillover problem in flow-cytometry, in which fluorescent signals correlate with each other due to spectrum overlaps. Inspired by the compensations method used in flow-cytometry studies [24], we leveraged our simulations to generate a spillover matrix that allows correcting for correlations between cell types.
To better compensate for low abundances in mixtures we created a simulated dataset where each sample contains 25% of the cell type of interest and the rest from a nonrelated cell type and produced a spillover matrix, a representation of the dependencies of scores between different cell types.
Similar to the performance compared to signatures, xCell also outperformed CIBERSORT enumerations in all comparable cell types, across all data sources ( Figure   2B-C and Supplementary Figure 5-6).

Validation of enrichment scores with cytometry immunoprofilings
In addition to the simulated mixtures analysis, we compared our estimates for cell types enrichments from gene expression profiles with mass spectrometry (CyTOF) immunophenotyping. We utilized independent publicly-available studies, in which a total of 165 individuals were studied for both gene expression from whole blood and FACS across 18 cell subsets from peripheral blood mononuclear cells (PBMC) (available in ImmPort SDY311 and SDY420) [28]. We calculated xCell scores for each of the signatures using the study's expression profiles and correlated the scores with the FACS fractions of the cell subsets. Of the 14 cell types with at least 1% abundance, xCell was able to significantly recover 10 and 12 cell subsets in SDY311 and SDY420 respectively (Pearson correlation between calculated and actual cell counts p-value < 0.05) ( Figure 3).
Comparing the performance of xCell to previously published signatures and CIBERSORT revealed that no other method was able to recover cell types that our method was not able to recover in both data sets ( Figure 3). In general, previous methods were able to recover signal only from major cell types, including B-cells, CD4+ and

Cell types enrichments in tumor samples
We next applied our methodology on 9,947 primary tumor samples across thirty-seven cancer types from the TCGA and TARGET projects [29] (Supplementary Figure 9).    Pearson coefficients microenvironment score as the sum of all immune and stroma cell types. We then correlated this microenvironment score with our previously generated purity estimations, which are based on copy number variations, gene expression, DNA methylation and H&E slides [30]. Our analysis showed highly significant negative correlations in all cancer types, suggesting this score as a novel measurement for tumor microenvironment abundance (Supplementary Figure 10).    Table 4).

Discussion
Recently, many studies have shown different methodologies for digital dissection of cancer samples [3,6,[9][10][11][12][13]. These studies have suggested novel insights in cancer research and related to therapy efficacy. However, it is important to remember that the methods that have been applied for portraying the tumor microenvironment have only retained limited validation, and it is unclear how reliable their estimations are. In this study, we took a step back and focused on generating cell type gene scores that could reliably estimate enrichments of cell types. Our method, which is gene-signature based, is more reliable due to its reliance on a group of signatures for each cell type, learned from multiple data sources, which increases the ability to detect the signal from the noise. Our method also integrates a novel approach to remove dependencies between cell types, which allow better reliability in studying closely related cell types.
To develop xCell, we collected the most comprehensive resource to date of primary cell types, spanning the largest set of human cell types. We then performed an extensive validation of the predicted cell types inferences in mixed samples. Our method for choosing a set of signatures that are reliable across several data sources has proven to be beneficial, as our scores robustly outperformed all available methods in predicting the abundance of cell types in in silico mixtures and blood samples. Based on our evaluation, xCell provides the most accurate and sensitive way to identify enrichments of many cell types in an admixture, allowing the detection of subtle differences in the enrichments of a particular cell type in the tumor microenvironment with high confidence.
It is important to note that xCell, as all other methods, performed significantly better in simulated mixtures than in real mixtures. Here, there are several technical reasons for this discrepancy. First, the cytometry analyses were performed on PBMCs, while the gene expression profiles were generated from the whole blood. Second, not all genes required by xCell were present, in fact in SDY420 only 54.5% of the genes required by xCell were available. However, other explanations for the lower success to infer abundances in real samples are warranting -it may well be possible that the expression patterns of marker genes in mixtures is different than in purified cells. Recent technologies such as single-cell RNA-sequencing may be able to clarify how much this may perturb the analyses.
We chose to apply a gene signature enrichment approach over deconvolution methods because of several advantages that the former provides. First, gene signatures are rank-based, and therefore are suitable for cross-platform transcriptome measurements.
We showed here that our scores reliably predict enrichments in different RNA-seq Another limitation of our method is that the inferences are strictly enrichment scores, and cannot be interpreted as proportions. This is due to the inability to translate the minimal and maximal scores produced by ssGSEA to clear proportions. Thus, while our method attempts to calibrate the scores to resemble proportions, this cannot be reliably used as such. This limitation also does not allow to provide statistical significance for the inferences, by calculating an empirical p-value as suggested by Newman et al. [7].
In summary, tissue dissection methods are an emerging tool for large-scale characterization of the tumor cellular heterogeneity. These approaches do not rely on tissue dissociation, opposed to single-cell techniques, and therefore provide an effective tool for dissecting solid tumors. The immense availability of public gene expression profiles allows these methods to be efficiently performed on hundreds of historical cohorts spanning thousands of patients, and to associate them with clinical outcomes.
Here we presented the most comprehensive collection of gene expression enrichment scores for cell types. Our methodology for generating cell type enrichment scores and adjusting them to cell types proportions allowed us to create a powerful tool that is the most reliable and robust tool currently available for identifying cell types across data sources. We provide a simple web tool to the community and hope that further studies will utilize it for the discovery of novel predictive and prognostic biomarkers, and new therapeutic targets: http://xCell.ucsf.edu/.

Data sources
Signatures data sources: RNA-seq and cap analysis gene expression (CAGE) normalized FPKM were downloaded from the FANTOM5, ENCODE and Blueprint data portals. Raw Affymetrix microarray CEL files were downloaded from the Gene Expression Omnibus (GEO), accessions: GSE22886 (IRIS), GSE24759 (Novershtern) and GSE49910 (HPCA), and analyzed using the Robust Multi-array Average (RMA) procedure on probe-level data using Matlab functions. The analysis was performed using custom CDF files downloaded from Brainarray [32]. All samples were manually annotated to 64 cell types (Supplementary Table 1). Similarly, Agilent Whole Human Genome 4 x 44 K slides data of PBMC samples and the accompanying CyTOF data were downloaded from ImmPort accession SDY420 [33], and quantile normalized using Matlab functions. Multiple probes per gene were collapsed using averages. RNA-seq data of Cancer Cell Line Encyclopedia (CCLE) [22] was obtained using the PharmacoGx R package [34]. RSEM levels 9,947 primary tumor samples from TCGA and TARGET were downloaded from https://toil.xenahubs.net.

In silico simulated mixtures
We generated several types of simulated mixtures, but all are based on the same pipeline: 1) Given a data source of pure cell types, choose n cell types available in the data and choose a random fraction for each cell type (the fractions sum to 1). We denote this vector of fraction .

Filtering cancer genes:
In a previous study [16] we calculated using CCLE the number of cell lines that are over-expressing each gene (2-fold more than the peak of expression distribution  (2), 3,4 and 5. We repeated this procedure to each of the 6 data sources independently.
Only gene sets of at least 8 genes and no more than 200 genes were reserved. This scheme yielded 6,573 gene signatures corresponding to 64 cell types. We calculated single-sample gene set enrichment analysis (ssGSEA) for each of those signatures to score each sample in each of the data sources using the GSVA R package [35].
Choosing the "best" signature: For each signature we computed the t-statistic between be compared with a score from another signature. In addition, in sequencing-based data, the association between the underlying levels of the cell type was not linearly associated with the score. We thus designed a transformation pipeline for the scores (which is applied to both sequencing and microarray-based datasets separately) -for each cell type, using the synthetic mixtures we first shifted the scores to 0 using the minimal score (which corresponded to mixtures containing 0.8% of the cell type) and divided by 5000.
We then fit a power function to the scores ranging corresponding to abundances of 0.8% to 25.6%. We used this range because we are mostly interested in identifying cell types with low abundance, and above that the function exponential increase may interfere in a precise fitting. The power coefficient is then averaged across the data sources were the cell type is available (we denote this vector as ). After adjusting the score using the learned power coefficient, we fit a linear curve, and use the learned slope as a calibration parameter for the adjusted scores (denoted as 1).
Learning the spillover compensation reference matrix: Another limitation that was observed in the mixtures is the dependencies between closely related cell types: scores that predict enrichment of one cell type also predict enrichment of another cell type, which might not even be in the mixture. To overcome this problem we created a reference matrix of 'spillovers' between cell types. Below we focus on the generation of the sequencing-based spillover matrix, however equivalent process was performed to generate the microarray-based spillover matrix. We first generated a synthetic mixture set, where each mixture contains 25% of each of the cell types (median expression) and 75% of a 'control' cell type, as in the previous section. We then calculated raw cell types scores and transformed them using the learned coefficients as explained above. We combined all sequencing-based data sources together by using the average scores, and completed the matrix to be 64x64 by adding columns from cell types that are not present in any of the sequencing-based data using the microarray reference matrix. We then normalized each row of cell types scores by dividing it by the diagonal (denoted asspillover matrix, rows are cell types scores and columns are cell type samples). The diagonal, before the normalization, is also used for calibration (denoted as 2) The 'spillover' between a cell type score (x) and another cell type (y) is the ratio between x and y. Finally, we clean the spillover matrix to not compensate between parent and descendent cell types by compensating parent cell types only with other parent cell types (CD4+ T-cells are compensated against CD8+ T-cells, but not CD8+ Tem), and compensating child cell types only compared to other child cell types from the same parent and all other parents, but not child cell types from other parents. Some of the compensations where too strong, removing correlations between cell types and their corresponding signatures, thus we limit the compensation levels, off the diagonal, to 0.5.
The spillover matrix, power and calibration coefficients are available in Supplementary   Table 5.
Calculating scores for a mixture: Input: Gene expression data set normalized to gene length (such as FPKM or TPM), rows are genes and columns are samples (N -number of samples). Duplicate gene names are combined together. xCell uses a set of 10,808 genes for the scoring. It is recommended to use data sets that contain at least the majority of these genes. Missing values in a sample are treated as missing genes. It is also recommended to use as many samples as possible, transformed using the following formula: The output is matrix T of transformed scores. Different P, V1 and V2 are used for sequencing-based and microarray-based datasets. (4) Spillover compensation is then performed for each row using linear least squares that minimizes the following (as performed in flow cytometry analyses and explained in Bagwell et al. [24]): • − ! , such that ≥ 0 All x's are then combined to create the final xCell scores. The compensation may result in deteriorating real associations, thus we provide a scaling parameter (alpha), to multiply all off-diagonal cells in matrix K. In all experiments in the manuscript we used alpha=0.5. Different K matrices are used for sequencing-based and array-based data.

Cytometry analyses
Gene expression and cytometry data were downloaded from ImmPort (SDY311 and SDY420). The gene expression data was quantile normalized using Matlab functions, and multiple probes per gene were collapsed using averages. The cytometry data counts were divided by the viable/singlets counts. In the SDY311 dataset 10 patients had two replicates of expression profiles, and those were averaged. Two outlier samples in the cytometry data set were removed from further analyses (SUB134240, SUB134283).

Other tools
The CIBERSORT web tool was used for inferring proportions using the expression profile (http://cibersort.stanford.edu). CIBERSORT results of activated and resting cell types were combined, B-cell and CD4+ T-cells percentages are the combination of all their subtypes. t-SNE plots were produced using the Rtsne R package. Purity measurements were obtained from our previous publication [30]. Correlation plots were generated using the corrplot R package.

Availability of Data and Materials
The xCell R package for generating the cell type scores, and R scripts for the development of xCell are available at https://github.com/dviraran/xCell, and deposited to Zenodo (assigned DOI:10.5281/zenodo.800745).