A Bayesian Framework for Estimating Cell Type Composition from DNA Methylation Without the Need for Methylation Reference

Genome-wide DNA methylation levels measured from a target tissue across a population have become ubiquitous over the last few years, as methylation status is suggested to hold great potential for better understanding the role of epigenetics. Different cell types are known to have different methylation profiles. Therefore, in the common scenario where methylation levels are collected from heterogeneous sources such as blood, convoluted signals are formed according to the cell type composition of the samples. Knowledge of the cell type proportions is important for statistical analysis, and it may provide novel biological insights and contribute to our understanding of disease biology. Since high resolution cell counting is costly and often logistically impractical to obtain in large studies, targeted methods that are inexpensive and practical for estimating cell proportions are needed. Although a supervised approach has been shown to provide reasonable estimates of cell proportions, this approach leverages scarce reference methylation data from sorted cells which are not available for most tissues and are not appropriate for any target population. Here, we introduce BayesCCE, a Bayesian semi-supervised method that leverages prior knowledge on the cell type composition distribution in the studied tissue. As we demonstrate, such prior information is substantially easier to obtain compared to appropriate reference methylation levels from sorted cells. Using real and simulated data, we show that our proposed method is able to construct a set of components, each corresponding to a single cell type, and together providing up to 50% improvement in correlation when compared with existing reference-free methods. We further make a design suggestion for future data collection efforts by showing that results can be further improved using cell count measurements for a small subset of individuals in the study sample or by incorporating external data of individuals with measured cell counts. Our approach provides a new opportunity to investigate cell compositions in genomic studies of tissues for which it was not possible before.


Introduction
Epigenome-Wide Association Studies (EWAS), where genome-wide methylation levels are measured across a population and compared to a phenotype of interest, have become ubiquitous over the last few years. Many associations between methylation sites and disease status have been reported (e.g., multiple sclerosis [1], schizophrenia [2], and type 2 diabetes [3]), suggesting an important role for DNA methylation in complex diseases. Thus, DNA methylation status holds great potential for better understanding the role of epigenetics, potentially leading to better clinical tools for diagnosing and treating patients.
In a typical EWAS, we obtain a large matrix in which each entry corresponds to the methylation level (a number between 0 and 1) at a specific genomic position for a specific individual. This level represents the fraction of the probed DNA molecules that were found to have an additional methyl group at the specific position for the specific individual. In such studies, we typically search for rows of the methylation matrix (each corresponding to one genomic position) that are significantly correlated with a phenotype of interest. The analysis of EWAS is complicated by the fact that the studied tissue is typically a mixture of different cell types. Since each cell type may have a distinct methylation pattern, the resulting DNA methylation matrix is a convolution of the signals arising from the different cell types. As a result, a large number of false discoveries may be found in the common case where the cell type composition is correlated with the phenotype [4].
In principle, one can avoid false discoveries by adding high-resolution cell counts to a regression model commonly used in an EWAS. Unfortunately, such cell counting for a large cohort may be costly and often logistically impractical (e.g., in some tissues, such as blood, reliable cell counting can be obtained from fresh samples only). In order to overcome this problem and to allow correcting methylation data for cell type composition, several statistical and computational methods have been proposed [5][6][7][8][9]. These methods take either a supervised approach, in which reference data of methylation patterns from sorted cells are obtained and used for predicting cell compositions [5], or an unsupervised approach (reference-free) [6][7][8][9].
The main advantage of the reference-based method is that it provides direct (absolute) estimates of the cell counts, while current unsupervised methods are only capable of inferring components that capture linear combinations of the cell counts. However, the reference-based method can only be applied when relevant reference data exist. Currently, reference data only exist for blood [10], breast [11] and brain [12], for a small number of individuals (e.g., six samples in the blood reference [10]). In addition, the individuals in most data sets do not match the reference individuals in their methylation-altering factors such as age [13] and sex [14,15]. This problem was recently highlighted in a study showing that available blood reference collected from adults fails to estimate cell proportions of newborns [16]. It is therefore often the case that unsupervised methods are either the only option or are a better option for the analysis of EWAS.
As opposed to the supervised approach, although can be applied for any tissue in principle, the reference-free methods do not provide direct estimates of the cell type proportions. A few reference-free methods allow us to infer a set of components, or general axes, which were shown to be linearly correlated with cell type composition [8,9]. Unlike cell proportions, while linearly correlated components are useful in linear analyses such as linear regression, they cannot be used in any nonlinear downstream analysis (e.g., when studying specific cell types). Cell proportions may provide novel biological insights and contribute to our understanding of disease biology, and we therefore need targeted methods that are practical and low in cost.
Here, we propose an alternative strategy that utilizes prior knowledge about cell counts to improve upon the performance of reference-free methods, while addressing some of their limitations. We present a Bayesian semi-supervised method, BayesCCE (Bayesian Cell Count Estimation), which encodes experimentally obtained cell count information as a prior on the distribution of the cell type composition in the data. As we demonstrate here, the required prior is substantially easier to obtain compared with standard reference data from sorted cells. We can estimate this prior from general cell counts collected in previous studies, without the need for corresponding methylation data or any other genomic data.
We evaluate our method on two large data sets and on simulated data, and show that our method produces a set of components that can be used as cell composition estimates. We observe that each component of BayesCCE can be regarded as corresponding to a linear transformation of exactly one cell type (i.e. high absolute correlation with one cell type, but not necessarily good estimates in absolute terms). Considering existing reference-free methods as a baseline for estimating cell proportions, we find that BayesCCE provides improvement of up to 50% in correlation. We also consider the case where both methylation and cell count information are available for a small subset of the individuals in the sample, or for a group of individuals from external data. We show that our proposed Bayesian model can leverage such additional information, and we demonstrate that it allows us to impute missing cell counts in absolute terms. Testing this case on both real and simulated data, we find that measuring cell counts for a small group of samples (a couple of dozens) can lead to a further increase in the correlation of BayesCCE's components with the cell types composition. We therefore propose that future studies will consider measuring cell counts for at least a small number of the samples in the study, if possible, or incorporate into their analysis external data of samples with both methylation and measured cell counts from the same tissue.

Model
Let O ∈ R m×n be an m sites by n samples matrix of DNA methylation levels coming from heterogeneous source consisted of k cell types. For methylation levels, we consider what is commonly referred to as beta-normalized methylation levels, which are defined for each samples in site as the proportion of methylated probes out of the total number of probes. Put differently, O ji ∈ [0, 1] for each site j and sample i. We denote M ∈ R m×k as the cell type specific mean methylation levels for each site, and denote a row of this matrix, corresponding to the jth site, using M j,· . We denote R ∈ R n×k as the cell type proportions of the samples in the data, and denote X ∈ R n×p as a matrix of p covariates for each individual and a p-length row vector β j as their corresponding effects in the jth site. If the measurements of O ji were the true values of the methylation levels, then O ji = M j,· R T i + β j X T i . Due to measurement noise and other unmodeled factors, we incorporate an error term ji . Thus, the full model for the observed methylation levels is The constraints in (3) and in (4) require the cell proportions to be positive and to sum up to one in each sample, and the constraints in (5) require the cell type specific mean levels to be in the range [0, 1]. We note that the above formulation of the problem is similar to the one previously suggested in the context of reference-based estimation of cell proportions from DNA methylation by Houseman et al. [5]. The reference-based method first obtains estimates of M from reference methylation data collected from sorted cells of the cell types composing the studied tissue. Once M is known, R can be estimated by solving a standard quadratic program. If the matrix M is not known, which is a reference-free version of the problem, the above formulation of the problem can be regarded as a version of nonnegative matrix factorization (NNMF) problem. NNMF has been suggested in several applications in biology; notably, the problem of inference of cell type composition from methylation data has been recently formulated as a NNMF problem [9]. In order to optimize the model, the authors use an alternative optimization procedure in which M or R are optimized while the other is kept fixed. However, as demonstrated by the authors [9], their version of NNMF results in the inference of a linear combination of the cell proportions R. Put differently, more than one component of the NNMF is required for explaining each cell type in the data. Another recent reference-free method for estimating cell composition in methylation data, ReFACTor [8], performs a feature selection step followed by a principal components analysis (PCA). Similarly as in the NNMF solution, ReFACTor is an unsupervised method and it only finds principal components (PCs) that form linear combinations of the cell proportions rather than directly estimates the cell proportion values [8].
Here, we suggest a more detailed model by adding a prior on R and taking into account potential covariates. Specifically, we assume that where α 1 , ..., α k are assumed to be known. In practice, the parameters are estimated from external data in which cell type proportions of the studied tissue are known. Such experimentally obtained cell type proportions were used to test the appropriateness of the Dirichlet prior in describing cell composition distribution (data not shown). We are interested in estimating R. Deriving a maximum likelihood-based solution for this model and repeating the constrains for completeness results in the following optimization problem: Our intuition in this model is that since the priors on R are estimated from real data, incorporating them will push the solution of the optimization to retrun estimates of R which are closer to the true values as opposed to a linear combination of them.

Algorithm
Our algorithm uses ReFACTor as a starting point. Specifically, we use ReFAC-Tor's PCs (ReFACTor components) in order to estimate R by finding an appropriate linear transformation of the ReFACTor components. In principle, both ReFACTor and NNMF could be used as the starting point for our method. However we found that ReFACTor captures a larger portion of the cell composition variance compared with the NNMF solution (see Results). Applying ReFACTor on our input matrix O we get a list of t sites that are most informative with respect to the cell composition in O. LetÕ ∈ R t×n be a truncated version of O containing only the t sites selected by ReFACTor. We apply PCA onÕ to get L ∈ R t×d , P ∈ R n×d , the loadings and scores of the first d ReFACTor components. Then, we reformulate the original optimization problem in terms of linear transformations of L and P as follows: where · 2 F is the squared Frobenius norm, A ∈ R d×k is a transformation matrix such thatM = LA (M being a truncated version of M with the t sites selected by ReFACTor), V ∈ R d×k is a transformation matrix such that R = P V and B ∈ R d×p is a transformation matrix such that LB corresponds to the effects of each covariate on the methylation levels in each site. The constraints in (12) and in (13) correspond to the constraints in (8) and in (9), and the constraints in (14) correspond to the constraints in (10).
GivenV , we simply returnR = PV as the estimated cell proportions. Note that in the new formulation we are now required to learn only d(2k + p) parameters -d, k and p being small constants -a dramatically decreased number of parameters compared with the original problem which requires nk + m(k + p) parameters. By taking that approach, we make an assumption thatÕ consists of a low rank structure that captures the cell composition using d orthogonal vectors. While a natural value for d would be d = k, d is not bounded to be k. Particularly, in cases where substantial additional cell composition signal is expected to be captured by latter ReFACTor components (i.e. components beyond the first k), we would expect to benefit from increasing d. Clearly, overly increasing d is expected to result in overfitting and thus a decrease in performance. Finally, taking into account covariates with potentially dominant effects in the data should alleviate the risk of introducing noise intoR in case of mixed low rank structure of cell composition signal and other unwanted variation in the data. We note, however, that similarly to the case of correlated explaining variables in regression, considering covariates that are expected to be correlated with the cell type composition may result in underestimation of A, V and therefore to a decrease in the quality ofR.

Imputing cell counts using a subset of samples with measured cell counts
In practice, we observe that each of BayesCCE's components corresponds to a linear transformation of one cell type rather than to an estimate of that cell type in absolute terms. That is, it still lacks the right scaling (multiplication by a constant and addition of a constant) for transforming it into cell proportions. Furthermore, we would like the ith BayesCCE component to correspond to the ith cell type described by the prior using the α i parameter. Empirically, that is not necessarily the case, especially in scenarios where some of the α i values are similar. In order to address these two caveats, we suggest incorporating measured cell counts for a subset of the samples in the data. Assume we have n 0 reference samples in the data with known cell counts R (0) and n 1 samples with unknown cell counts R (1) (n = n 0 + n 1 ). This problem can be regarded as an imputation problem, in which we aim at imputing cell counts for samples with unknown cell counts. We can findM by solving the problem in (7) under the constraints in (10) for the n 0 reference samples while replacing R with R (0) and keeping it fixed. Then, givenM , we can now solve the problem in (11), after replacing LA withM (i.e. we find only V, B now), under the following constraints where P (0) contains n 0 rows corresponding to the reference samples in P , and P (1) contains n 1 rows corresponding to the remaining samples in P . In this case, both problems of estimating M and solving (11) while keepingM fixed are convex -the first problem takes the form of a standard quadratic problem and the latter results in an optimization problem of the sum of two convex terms under linear constraints. UsingM , estimated from cell counts and corresponding methylation levels of a group of samples, as well as adding the constraints in (15), are expected to direct the inference of R towards a set of components such that each one corresponds to one known cell type with a proper scale.

Implementation and practical issues
We estimate σ 2 in (11) as the mean squared error of predictingÕ with P and X. The α 1 , ..., α k Dirichlet parameters of the prior can be estimated from cell proportions using maximum likelihood estimators. In practice, we add a column of ones to both L and P in (11) in order to assure feasibility of the problem -these constant columns are used to compose the mean methylation level per site across all cell types and the mean cell proportion fraction in each cell type across all samples. In addition, we slightly relax some of the constraints in the problem to avoid problems due to numeric instability and inconsistent noise issues. First, we do not impose the equality constraints in (13) and in (17) but rather allow a small deviation from equality (5%). In addition, the inequality constraints in (12) and in (16) are changed to require the cell proportions to be greater than > 0, as a result of the logarithm term in the objective ( = 0.0001). Finally, given cell counts for a subset of the samples, we allow a small deviation from the equality constraints in (15) due to expected inaccuracies of cell counts measurements (1%).
We performed all the experiments in this paper using a Matlab implementation of BayesCCE. Specifically, we solved the optimization problems in BayesCCE using the fmincon function with the default interior-point algorithm, and we used the fastfit [17] Matlab package for calculating maximum likelihood estimates of the Dirichlet priors. All executions of BayesCCE required several minutes on a 64-bit Mac OS X computer with 3.1GHz and 16GB of RAM. Corresponding code is available at: https://github.com/cozygene/bayescce.

Evaluation of performance
The fraction of cell composition variation (R 2 ) captured by each of the referencefree methods, ReFACTor and NNMF, was computed for each cell type using a linear predictor fitted with the first k components provided by the method. In order to evaluate the performance of BayesCCE, for each component i we calculated its correlation with the ith cell type, and reported the mean absolute correlation (MAC) across the k estimated cell types. Empirically, we observed that in the case of k = 6 with no known cell counts for a subset of samples, the ith BayesCCE component did not necessarily correspond to the ith cell type. Put differently, the labels of the k cell types had to be permuted before calculating the MAC. In this case we considered the permutation of the labels which resulted with the highest MAC as the correct permutation. In the rest of the cases, we did not apply such permutation (all the experiments using k = 3 and all the experiments using k = 6 with known cell counts for a subset of the samples). For evaluating ReFACTor and NNMF, reference-free methods which do not attribute their components to specific cell types in any scenario, we considered the permutation leading to the highest MAC in all experiments when compared with BayesCCE. In addition, we considered the mean absolute error (MAE) as an additional quality measurement. When calculating absolute errors for the ReFACTor components, we scaled each component to be in the range [0, 1]. Finally, in experiments where cell counts were assumed to be known for a subset of the samples, MAC and MAE were calculated using only the samples for which cell counts were assumed to be unknown.

Implementation of ReFACTor and NNMF
The ReFACTor components were calculated for each data set using the parameters k = 6 and t = 500 and according to the implementation of ReFACTor described at http://glint-epigenetics.readthedocs.io, while accounting for known covariates in each data set. More specifically, in the Liu et al. data [18] we accounted for age, sex, smoking status and batch information, and in the Hannum et al. data [19] we accounted for age, sex, ethnicity and batch information. We used the first six ReFACTor components (d = 6) for simulated data in order to accommodate with the number of simulated cell types, and the first ten components (d = 10) for real data, as real data are typically more complex and are therefore more likely to contain substantial signal in latter components. The NNMF components were computed for each data set using the RefFreeEWAS R package from the subset of 10,000 most variable sites in the data set, as performed in the NNMF paper by the authors [9].

Implementation of the reference-based algorithm
We implemented the reference-based algorithm according to Houseman et al. [5], using 300 highly informative methylation sites defined in a recent study [20] and using reference data collected from sorted blood cells [10].

Data sets
We evaluated the performance of BayesCCE using three data sets collected with the Illumina 450K DNA methylation array. All three data sets are publicly available and preprocessed versions of the data were downloaded from the Gene Expression Omnibus (GEO) database. The first data set (accession GSE42861) was studied in a recent association study of DNA methylation with rheumatoid arthritis (RA) by Liu et al. (n = 686) [18]. The second data set (accession GSE40279) was originally used in a study of aging rates by Hannum et al. (n = 656) [19]. In addition, we used a reference data set of sorted cell types collected in six individuals from whole blood tissue (accession GSE35069) [10]. The latter was used for generating simulated data sets and for estimating the cell type specific mean levels in the implementation of the reference-based algorithm. We excluded from each data set sites coming from the sex chromosomes, as well as polymorphic and cross-reactive sites, as was previously reported [21]. Two samples in the Hannum et al. data were detected as outliers by PCA and were therefore excluded. When running BayesCCE on the data sets by Liu et al. and Hannum et al. we considered known batch information in the analysis.

Data simulation
We simulated data following a model that was previously described in details elsewhere [8]. Briefly, we used methylation levels from sorted blood cells [10] and, assuming normality, estimated maximum likelihood parameters for each site in each cell type. Cell type specific DNA methylation data were then generated for each simulated individual from normal distributions with the estimated parameters, conditional on the range [0,1], for six cell types and for each site. Cell proportions for each individual were generated using a Dirichlet distribution. The parameters for the Dirichlet were fitted using the cell proportions estimated for the individuals in the Liu etl al. [18] and Hannum et al. [19] data sets using the reference-based method [5]. Finally, observed DNA methylation levels were composed from the cell type specific methylation levels and cell proportions for each individual, and a random normal noise was added to every data entry to simulate technical noise (σ = 0.01).

Benchmarking existing reference-free methods
We first demonstrate that existing reference-free methods can estimate components that are correlated with the tissue composition in methylation data collected from heterogeneous sources. For the experiments in this paper, we used the whole-blood data set by Liu et al. [18] (n = 686) and the whole-blood data set by Hannum et al. [19] (n = 654; see Methods). In addition, we simulated data based on reference data set of methylation levels from sorted leukocytes cells [10] (see Methods). While cell proportions were known for each sample in the simulated data, cell counts were not available for the two real data sets. We therefore estimated the cell type composition of six major blood cell types (granulocytes, monocytes and four subtypes of lymphocytes) using a reference-based method [5], which was shown to reasonably estimate leukocyte cell proportions from whole blood methylation data collected from adult individuals [22,16,20]. Due to the absence of large publicly available data with measured cell counts, these estimates were considered as the ground truth for evaluating the performance of the different methods.
We considered two reference-free methods, ReFACTor [8] and NNMF [9], both allowing to generate components that were shown to capture cell type composition information in methylation. We evaluated the first six components of ReFACTor and the six components provided by NNMF -six being the number of estimated cell types composing the ground truth. We found both methods to capture a large portion of the cell composition in all data sets ; particularly, we observed that ReFACTor performed considerably better than NNMF in all data sets (Figure 1 a-c). Yet, in spite of the fact that both ReFACTor and NNMF capture a large portion of the cell composition variance, each component provided by these methods is a linear combination of the cell types in the data rather than an estimate of the proportions of a single cell type. As a result, as we show in the following experiments, both methods, in general, perform poorly when their components are considered as estimates of cell proportions.

Evaluation of BayesCCE on real and simulated data
We evaluated BayesCCE under various scenarios. The results of the experiments described hereafter are summarize in Figure 1 d-h. In the first and most common scenario, we assume that no appropriate reference methylation data of sorted cells exist for the studied tissue, but we do have information about the distribution of the cell composition in the studied tissue. Such information can be inferred from cell counts collected in previous studies of the same tissue (without the need for any additional genomic data). This information can be then used by BayesCCE for tuning the prior required for the model (see Methods). In order to demonstrate this, we used cell counts collected from 35 healthy adults in a recent study [23]. These cell counts measured levels of lymphocytes, monocytes and three subtypes of granulocytes. Since our ground truth, compiled using the reference-based method, contained only the total granulocyte levels, we collapsed the three subtypes of granulocytes into a total measurement of granulocytes.
We applied BayesCCE on the real data sets under the assumption that three cell types compose the data (k = 3). Since each component of BayesCCE is expected to correspond to a linear transformation of one cell type, we report absolute linear correlations (see Methods). BayesCCE provided excellent estimates of the levels of granulocytes and lymphocytes in both data sets (r = 0.96 and r = 0.98 in the Liu et al. data, and r = 0.94 and r = 0.98 in the Hannum et al. data; see Figure 2). In contrast, we observed poor estimates of the monocyte levels (r = 0.14 in the Liu et al. data and r = 0.26 in the Hannum et al. data). We note that poor performance in capturing some cell type may be partially derived by inaccuracies introduced by the reference-based estimates which are used as the ground truth in our experiments. For example, several recent studies consisted of a relatively large number of samples for which both methylation levels and cell count measurements were available, demonstrated that while the reference-based estimates of the overall lymphocyte and granulocyte levels were found to be highly correlated with the true levels, the accuracy of the estimates of monocytes was found to be substantially lower [16,8,24]. Such inaccuracies in estimating a specific cell type by the reference-based approach may be the result of utilizing inappropriate reference. More specifically, cell types with highly variable methylation patterns across different populations may not be well represented for the target population by existing reference (coming from a specific population). Another possible driver for low quality estimates is having cell types with methylation profiles that do not distinct them well enough from other cell types in the tissue, or failing to select a set of informative features that mark some of the cell types. As a second validation of our method, we used the reference-based estimates of the six cell types for learning the prior. For each one of the two real data sets, we used the cell proportion estimates of the other data set for learning the prior. We then applied BayesCCE on each data set under the assumption of six cell types (k = 6) and measured the correlation with the reference-based estimates. The mean absolute correlation across the six cell types was found to be 0.57 in the Liu et al. data and 0.56 in the Hannum et al. data (Figure 3). In addition to the real data analysis, we further conducted a similar experiment on simulated data (n = 650). In this case, we estimated the prior from a group of 50 samples that were generated from the true distribution. We applied BayesCCE on ten different simulated data sets, and found the mean absolute correlation across all cell types and across all the simulated data sets to be 0.62. As expected, applying BayesCCE on increased sample size resulted in an improved performance (Supplementary Figure S1).

Evaluation of cell count imputation
Next, we considered the scenario in which cell counts are known for a small subset of the samples in the data. This problem can be viewed as an imputation problem of the missing cell count values (see Methods). We repeated the previous experiments (k = 3 and k = 6), only this time we used the values of the estimated cell counts for randomly selected 5% of the samples in each data set. As opposed with the previous experiments, in which each one of BayesCCE's components formed a linear transformation of one of the cell types, here we get that the BayesCCE components form absolute estimates of the cell proportions (i.e. low absolute error). In addition, we observed up to 22% improvement in the mean correlation values compared with our previous experiments (Supplementary Figure S2 and Figure S3). We further tested this approach on simulated data (n = 650), while assuming known cell counts for 5% of the samples in the data, and found the mean correlation across different cell types and across ten different simulated data set to be 0.78. Applying this approach with an increased number of samples for which cell counts are known, reveals that the cell count estimates can be improved using a relatively small subset of a couple of dozens of samples with known cell counts (Supplementary Figure S4).
In the absence of cell counts for a subset of the individuals in the data, external data with samples for which both methylation levels and cell counts are available can be added to the analysis. Again, we repeated the previous experiments (k = 3 and k = 6), only this time for each data set we added randomly selected 5% of the samples from the other data set, and used both their methylation levels and estimated cell counts in the analysis. Unlike in the previous experiments, here we potentially introduce new batch effects into the analysis, as in each experiment the original sample is combined with external data. We therefore accounted for the new batch information by adding it as a new covariate into BayesCCE. We observed up to 14% improvement in the mean correlation values compared with our previous experiments not taking any cell counts into account (Supplementary Figure S5 and Figure S6), showing that incorporating external samples with both methylation and cell counts can be a practical and useful way for estimating cell counts.

Discussion
We introduce BayesCCE, a Bayesian method that estimates cell type composition from heterogeneous methylation data using a prior on the cell composition distribution. In contrast to previous methods, using BayesCCE we can generate components such that each component corresponds to a linear transformation of a single cell type. These components can allow researchers to perform downstream analysis that is not possible using existing reference-free methods.
Our approach is based on finding a suitable linear transformation of the components found by ReFACTor [8]. Thus, it is limited by the quality of the ReFACTor components, and particularly BayesCCE will provide the exact same result as ReFACTor if used for correcting for potential cell type composition confounder in methylation data. We therefore suggest to use ReFACTor for correction and BayesCCE for cases in which a study of individual cell types is performed. We note that several supervised and unsupervised deconvolution methods have been suggested for estimating cell composition from gene expression [25][26][27][28][29]. However, these were refined for gene expression data and, to the best of our knowledge, none of these methods takes into account prior knowledge about the cell composition distribution as in BayesCCE. It remains of interest to investigate whether BayesCCE can be adapted for estimating cell composition from gene expression without the need for purified expression profiles.
The parameters of the prior required for BayesCCE can be estimated by utilizing previous studies that collected cell counts from the tissue of interest. Since no other genomic information is required, obtaining such data is relatively easy for many tissues, such as brain [30], heart [31] and adipose tissue [32]. Particularly, such data should be substantially easier to obtain compared to reference data from sorted cells for the corresponding tissues. Ideally, one would want to use cell counts coming from the same population as the target population in the study, especially when the cell composition distribution of the studied tissue may vary substantially across different populations. While this may be a potential limitation of BayesCCE in cases where cell counts from the target population are not available, our results using priors estimated from three differ-ent data sets empirically show that priors estimated from a different population than the target population can still provide good estimates.
Since no large data with measured cell counts are currently publicly available, we used a supervised method [5] for obtaining cell type proportion estimates, which were used as the ground truth in our experiments. Even though the method used for obtaining these estimates was shown to reasonably estimate leukocyte cell proportions from whole blood methylation data in several independent studies [22,16,20], these estimates may have introduced biases into the analysis. Particularly, in the presence of systematic biases, the estimates could have affected the estimated priors, which in turn could have affected the results. However, we believe that our results on several independent data sets, including simulated data, and the use of priors estimated from several sources, including real cell counts, provide a compelling evidence for the utility of BayesCCE.
Finally, we demonstrate that imputation of the cell counts can be highly accurate even when cell counts are available for only a relatively small number of individuals. Moreover, in the general setting of BayesCCE, each component is correlated to one cell type, and the identity of that cell type may not be known, while in the case of imputation BayesCCE is able to reconstruct the cell counts up to a small absolute error (i.e. each component corresponds to a known cell type and is scaled to form cell proportion estimates of that cell type). We therefore recommend that in future studies either the cell counts be measured for at least a couple of dozens of the samples or external data of samples with measured cell counts be utilized in the analysis.