 Method
 Open Access
 Published:
Melissa: Bayesian clustering and imputation of singlecell methylomes
Genome Biology volume 20, Article number: 61 (2019)
Abstract
Measurements of singlecell methylation are revolutionizing our understanding of epigenetic control of gene expression, yet the intrinsic data sparsity limits the scope for quantitative analysis of such data. Here, we introduce Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical method to cluster cells based on local methylation patterns, discovering patterns of epigenetic variability between cells. The clustering also acts as an effective regularization for data imputation on unassayed CpG sites, enabling transfer of information between individual cells. We show both on simulated and real data sets that Melissa provides accurate and biologically meaningful clusterings and stateoftheart imputation performance.
Background
DNA methylation is probably the best studied epigenomic mark, due to its wellestablished heritability and widespread association with diseases and a broad range of biological processes, including Xchromosome inactivation, cell differentiation, and cancer progression [1–3]. Yet its role in gene regulation, and the molecular mechanisms underpinning its association with diseases, is still imperfectly understood.
Bisulfite treatment of DNA followed by sequencing (BSseq) has provided a powerful tool for measuring the methylation level of cytosines on a genomewide scale with single nucleotide resolution [4]. BSseq protocols have been vastly improved over the last decade, with BSseq rapidly becoming a widespread tool in biomedical investigation. Nevertheless, until very recently, BSseq could only be used to measure methylation in bulk populations of cells [5], preventing effective investigations of the role of DNA methylation in shaping transcriptional variability and early development [6, 7].
This shortcoming has been addressed within the last 5 years through the development of protocols to measure DNA methylation at singlecell resolution using either scBSseq [8] or scRRBS [9] making it possible to uncover the heterogeneity and dynamics of DNA methylation [10]. Even more recently, methods have been developed that can sequence both the methylome and the transcriptome or other features in parallel, potentially enabling a quantification of the role of DNA methylation in explaining transcriptional heterogeneity [11–13]. However, due to the small amounts of genomic DNA per cell, these protocols usually result in very sparse genomewide CpG coverage (i.e., for most CpGs, we have missing values), ranging from 5% in highthroughput studies [14, 15] to 20% in lowthroughput ones [8, 11]. The sparsity of the data represents a major hurdle to effectively use singlecell methylation assays to inform our understanding of epigenetic control of transcriptomic variability, or to distinguish individual cells based on their epigenomic state.
In this paper, we address these problems by using a twopronged strategy. First, we note that several recent studies have highlighted the importance of local methylation profiles, as opposed to individual CpG methylation, in determining the epigenetic state of a region [16–18]. This implies that local spatial correlations may be effectively leveraged to ameliorate the issue of data sparsity. Secondly, singlecell BSseq protocols, as all singlecell highthroughput protocols, simultaneously assay a large number of cells, ranging from several tens [8] to a few thousands in the most recent studies [14]. Such abundance of data could be exploited to our advantage to transfer information across similar cells.
We implement both of these strategies within Melissa (MEthyLation Inference for Single cell Analysis), a Bayesian hierarchical model that jointly learns the methylation profiles of genomic regions of interest and clusters cells based on their genomewide methylation patterns. In this way, Melissa can effectively use both the information of neighboring CpGs and of other cells with similar methylation patterns in order to predict CpG methylation states. As an additional benefit, Melissa also provides a Bayesian clustering approach capable of identifying subsets of cells based solely on epigenetic state, to our knowledge the first clustering method tailored specifically to this rapidly expanding technology. We benchmark Melissa on both simulated and real singlecell BSseq data, demonstrating that Melissa provides both stateofthe art imputation performance and accurate clustering of cells. Furthermore, thanks to a fast variational Bayes estimation strategy, Melissa has good scalability and can provide an effective modeling tool for the increasingly large singlecell methylation studies which will become prevalent in coming years.
Results and discussion
Melissa addresses the data sparsity issue by leveraging local correlations between neighboring CpGs and similarity between individual cells (see Fig. 1). The starting point is the definition of a set of genomic regions (e.g. genes or enhancers) over which the model will be applied. Within each region, Melissa postulates a latent profile of methylation, a function mapping each CpG within the region to a number in [0,1] which defines the probability of that CpG being methylated. To ensure spatial smoothness of the profile, Melissa uses a generalized linear model (GLM) of basis function regression along the lines of [16] (with modified likelihood to account for single cell data). Local correlations are however often insufficient for regions with extremely sparse coverage, and these are quite common in scBSseq data. Therefore, we share information across different cells by coupling the local GLM regressions through a shared prior distribution. In order to respect the (generally unknown) population structure that may be present within the cells assayed, we choose a (finite) Dirichlet mixture model prior. The output of Melissa is therefore twofold: at each genomic region in each cell, we get a predicted profile of methylation, which can be used to impute missing data (i.e., unassayed CpGs). For each cell, we also get a discrete cluster membership probability, providing a methylomebased clustering of cells. This twofold output of Melissa reflects its methodological foundations as a hybrid between a global unsupervised model (Bayesian clustering of methylomes) and a local supervised learning model (GLM regression for every region). In this sense, Melissa is closer to a mixture of experts model [19, Chapter 14, Section 5] than a standard mixture model.
Benchmarking Melissa on simulated data
We benchmark the ability of our model to cluster and impute CpG methylation states at the singlecell level both on simulated and mouse embryonic stem cell (ESC) data sets. To assess test prediction performance, we consider different metrics, including Fmeasure, the area under the receiver operating characteristic curve (AUC), and precision recall curves [20]. We explore the performance of a number of methods as we vary three possible experimental parameters: the number of cells assayed, the cluster dissimilarity (how different the methylomes of cells in different clusters are expected to be), and the CpG coverage (defined as the fraction of CpG sites covered by at least one read, averaged over all cells).
To benchmark the performance of Melissa in predicting CpG methylation states, we compare it against six different imputation strategies. As a baseline approach, we compute the average methylation rate separately for each cell and region (Rate), that is, the average is taken over all CpG sites forming a genomic region. We also use the BPRMeth model [16, 21], where we account for the binary nature of the observations, which we train independently across cells and regions (BPRMeth). Note that BPRMeth shares information across CpG sites inside each genomic region; however, it does not transfer information across cells. To share information across cells, but not across neighboring CpGs inside the region, we constrain Melissa to infer constant functions, i.e., learn average methylation rate (Melissa rate). We also use a Gaussian mixture model (GMM) that takes as input average M values [22] instead of average methylation rates across the region (see the “Methods” section); to avoid possible problems due to highdimensionality, the GMM method was also tested on reduceddimensionality data, where the first ten principal components were retained. Additionally, as a fully independent baseline, we use a Random Forest classifier trained on individual cells and regions, where the input features are the observed CpG locations, and the response variable is the CpG methylation state: methylated or unmethylated (RF). This is essentially the method of [23], however, without using additional annotation data or DNA sequence patterns. We delay comparisons with the deep learning method DeepCpG [24] to the next section, as DeepCpG is not applicable in the settings of this simulation (see below).
In order to generate realistic simulated singlecell DNA methylation data, we extracted methylation profiles from real (bulk) BSseq data using the BPRMeth package [21], and then generated binary methylation levels at a random subset of CpGs to simulate the low coverage of scBSseq. In total, we simulated N=200 cells from K=4 subpopulations, where each cell consisted of M=100 genomic regions. Additionally, to account for different levels of similarity between cell subpopulations, we simulated 11 different data sets by varying the proportion of similar genomic regions between clusters. Finally, to assess the performance of Melissa as a function of assayed single cells, we simulated 10 different data sets by varying N, the total number of single cells (see the “Methods” section).
Applying the competing methods to synthetic data, we observe that Melissa yields a substantial improvement in prediction accuracy compared to all other models (Fig. 2, Additional file 1: Figure S1 and S2). Notably, Melissa is robust across different settings of the data, such as CpG coverage proportion (Fig. 2a) or the total number of cells assayed in each experiment (Fig. 2b). Due to its ability to transfer information across cells and neighboring CpGs, our model robustly maintains its prediction accuracy at a very sparse coverage level of 10% or even when assaying around 25 single cells. The BPRMeth and RF models perform poorly at low CpG coverage settings, becoming comparable to Melissa when using the majority of the CpGs for training set. Importantly, Melissa still performs better at 90% CpG coverage, demonstrating that the clustering acts as an effective regularization for imputing unassayed CpG sites. As expected, Melissa Rate and GMM have very similar performance (due to the very similar model structure); for both methods, performance is significantly weaker than Melissa across the full range of simulation settings, since they are not expressive enough to capture spatial correlations between CpGs. Using GMM on reduced dimensionality data did not lead to an improvement in performance, either for imputation or clustering (data not shown). Finally, the naive Rate method has the worst imputation performance of all methods, by a considerable margin. The imputation performance of all methods is relatively insensitive to the degree of cluster dissimilarity (Additional file 1: Figure S2).
Next, we consider the clustering performance of Melissa. Since most of the rival methods do not have a notion of clustering, we compare Melissa to clustering using methylation rates for binary data (Melissa Rate) or Gaussian data (GMM) using M values [22]. As a performance metric, we use the Adjusted Rand Index (ARI) [25] between the true cluster assignment and the predicted cluster membership returned from the model. Figure 3a shows ARI values comparing the three models for varying CpG coverage (with cluster dissimilarity level at 0.5 and N = 200 cells). Melissa performs perfectly in all settings, demonstrating its power and sensitivity in identifying robustly the cell subpopulation structure. When varying the level of cluster dissimilarity (see Fig. 3b), the model is still able to retain its high clustering performance. As expected, for settings with low variability between clusters (i.e., cell subpopulations are difficult to distinguish), the performance drops; however, Melissa is consistently superior to the Melissa Rate and GMM models and rapidly reaches nearperfect clustering accuracy. Similarly, when varying the total number of cells assayed in each experiment (see Fig. 3c), Melissa retains its almost perfect clustering performance and is still consistently superior than the competing models.
Subsequently, we test Melissa’s ability to perform model selection, that is, to identify the appropriate number of cell subpopulations. To do so, we run the model on simulated data, setting the initial number of clusters to K = 10 and letting the variational optimization prune away inactive clusters [26]. We used both broad (red line) and shrinkage (blue line) priors. Figure 3d shows that the variational optimization automatically recovered the correct number of mixture components for almost all parameter settings. As expected, in settings with high between cluster similarity, the model with shrinkage prior returned fewer clusters, since the data complexity term in Eq. (9) (see the “Methods” section) was penalizing more the variational approximation compared to the gain in likelihood from explaining the data. Finally, we assess the scalability of Melissa with respect to the number of single cells. Additional file 1: Figure S3 compares the variational Bayes (red line) with the Gibbs sampling (blue line) algorithm, which demonstrates the good scalability of variational inference where we can analyze thousands of single cells in acceptable running times. The maximum number of iterations for the variational Bayes algorithm was set to 400, and the Gibbs algorithm was run for 3000 iterations. Both algorithms are implemented in the R programming language and were run on a machine utilizing at most 16 CPU cores.
Benchmarking Melissa on subsampled bulk ENCODE data
The results in the previous section convincingly showed a substantial advantage of Melissa over competing methods both in terms of imputation performance and in terms of clustering. However, conditioned on some seed profiles learnt from bulk data, the simulation was conducted on data which was directly sampled from the generative Melissa model (with some additional noise), which could conceivably introduce an unfair bias in the comparison. Additionally, since data were simulated as separate regions, comparison with the deep learning method DeepCpG [24] was not possible, since DeepCpG requires the information of a large number of neighboring CpGs to predict the methylation state of each target CpG site. To faithfully simulate scBSseq data, we generated two additional synthetic data sets by directly subsampling bulk ENCODE reduced representation bisulfite sequencing (RRBS) and wholegenome bisulfite sequencing (WGBS) experiments (see the “Methods” section). For the bulk RRBS data, we randomly subsampled 10% of the mapped reads and generated 40 pseudosingle cells from the GM12878 and H1hESC cell lines. Due to the higher sequencing depth of bulk WGBS experiments, only 0.5% of the mapped reads were subsampled to generate pseudosinglecell methylomes. Subsequently, reads falling in the same genomic site were binarised to obtain a digital output of methylation. Finally, the two cell lines were combined in a single data set of 80 pseudosingle cells prior to running Melissa. This procedure produces data with a more similar structure to real scBSseq data, since the uneven read coverage better captures the structure of missing data observed in single cell epigenomic experiments.
Table 1 shows the results for the two studies when imputing CpGs falling in genomic regions of ± 2.5 kb around transcription start sites (TSS) for different levels of CpG coverage. Consistently with the simulation study in the previous section, Melissa performs significantly better (on scRRBS synethtic data) or comparable (on scWGBS synthetic data) to competitors at imputation tasks. As reported in [24], DeepCpG performs very strongly with comparable accuracy to Melissa across all CpG coverage settings (notice that training of DeepCpG is however slightly different, see “Methods” section). The systematically lower performance of DeepCpG on the scRRBS data set is to be expected as DeepCpG relies on information from neighboring CpGs over a large region, and might therefore be at disadvantage for data generated using this technology. The results are consistent across all different metrics considered in this paper and when increasing the window size to ± 5 kb around TSS (see Additional file 1: Figure S4–S9). Finally, Melissa could easily separate both cell subpopulations for all settings considered in this study.
Melissa accurately predicts methylation states on real data
To assess Melissa’s performance on real scBSseq data, we considered two mouse ESC data sets generated from scM&Tseq [11] and scBSseq [8] protocols. The mouse ESCs were cultured either in 2i medium (2i ESCs) or serum conditions (serum ESCs); hence, we expect methylation heterogeneity between cell subpopulations. In addition, in serum ESCs, there is evidence of additional CpG methylation heterogeneity [27], making these data suitable for the model selection task to infer cell subpopulation structure. The analysis on both data sets was performed on six different genomic contexts: protein coding promoters with varying genomic windows: ± 1.5 kb, ± 2.5 kb, and ± 5 kb around TSS, active enhancers, super enhancers, and Nanog regulatory regions (see the “Methods” section for details on data preprocessing). It should be noted that DeepCpG is designed to predict individual missing CpGs, rather than missing regions, and requires always information about neighboring CpGs. This means that, during prediction, DeepCpG always has access to more data than competing methods, potentially providing it with an unfair advantage; to partly address this problem, we also present results when DeepCpG had access to subsampled data (labeled DeepCpG Sub in our figures). In general, DeepCpG should be thought as complementary to Melissa, and comparisons should be evaluated cautiously (see below).
We first applied Melissa on the scM&Tseq data set which consists of 75 single cells (14 2i ESCs and 61 serum ESCs). Figure 4a shows a direct comparison of the imputation performance of all the methods across a variety of genomic contexts. Melissa is better or comparable to rival methods in terms of AUC (see Fig. 4a) and substantially more accurate in terms of Fmeasure (Additional file 1: Figure S10), demonstrating its ability to capture local CpG methylation patterns. DeepCpG also performs strongly on most genomic regions, indicating that a flexible deep learning method is effective in capturing patterns of methylation. Similar results were obtained by considering different metrics (Additional file 1: Figure S10–S12). Boxplots show performance distributions across 10 independent training/test splits of the data, except for DeepCpG, where the high computational costs prevented such investigation. Interestingly, methods based on methylation rates performed poorly at promoters, underlining the importance of methylation profiles in distinguishing epigenetic state near transcription start sites and identifying meaningful cell subpopulations. For all models, the imputation performance (in terms of AUC) at active enhancers was lower, indicating high methylation variability across cells and neighboring CpG sites as shown in [8].
In terms of clustering performance, Melissa confirms that the data supports the existence of a subpopulation of serum cells as suggested previously [27], by returning three clusters in almost all contexts. Further insights on the biological significance of the clusters obtained can be gleaned by inspecting the inferred methylation profiles at relevant regions. Figure 4b shows posterior methylation profiles for three developmental genes for each cell subpopulation (Additional file 1: Figure S13 shows additional methylation profiles of developmental genes). Each color corresponds to a different cell subpopulation, with orange profiles corresponding to 2i ESCs which are globally hypomethylated. The green and purple profiles correspond to serum cells, which, as expected, present an increased level of methylation overall. However, Melissa identifies a clear subpopulation structure within these serum cells: the purple cluster clearly represents a subpopulation of cells which has only incompletely transitioned towards the final differentiated state (high global methylation punctuated by hypomethylated CpG islands). Interestingly, 2i cells can be easily separated from serum cells based on methylation rate alone, due to the global hypomethylation of 2i cells; however, the subpopulation structure within serum cells appears to be determined by changes in profiles.
As a second real data set, we analyzed the smaller scBSseq data set which consists of only 32 cells (12 2i ESCs and 20 serum ESCs). The imputation performance in terms of AUC across genomic contexts is shown in Fig. 5. Melissa retains its high prediction accuracy and is comparable with DeepCpG across most contexts (see Additional file 1: Figure S14–S16 for performance on different metrics), even though the full DeepCpG model has slightly better performance on this data set. This suggests that the small number of cells in this data set did not allow an effective sharing of information. In terms of clustering performance, Melissa identifies three clusters in the vast majority of settings, once again underlying the emergence of epigenomically distinct populations within serum cells (see Additional file 1: Figure S17 and S18 for example methylation profiles across genomic contexts).
A note on the comparison with DeepCpG
Melissa and DeepCpG models reported substantially better imputation performance compared to the rival methods and show comparable performance when analyzed on real data sets, demonstrating their flexibility in capturing complex patterns of methylation. However, the two methods have significantly different computational performances. In our experiments, Melissa’s runtime was less than 6 h for all genomic contexts running on a small server machine utilizing at most ten CPU cores (see Additional file 1: Table S1 and S2). By contrast, DeepCpG required around 3 to 4 days to analyze each data set on a GPU cluster equipped with high end NVIDIA Tesla K40ms GPUs, and had very high memory requirements. These computational overheads effectively make DeepCpG out of reach for smaller research groups. On the other hand, Melissa operates on a set of genomic contexts of interest (e.g., promoters), while DeepCpG is designed for genomewide imputation; computational performance of both methods will therefore depend on specific choices, such as the size/number of the regions of interest for Melissa, or the number of training chromosomes for DeepCpG.
In addition to the differences in scope between the two methods, one should also be cautious when directly comparing prediction performances due to the different design of the DeepCpG model. DeepCpG is trained on a specific set of chromosomes and considers each CpG site independently; hence, it does not have a notion of genomic region to be trained on and will in any case utilize information from neighboring CpGs within or outside the region, information that Melissa and the rival methods do not have access to.
Conclusions
Singlecell DNA methylation measurements are rapidly becoming a major tool to understand epigenetic gene regulation in individual cells. Newer platforms are rapidly expanding the scope of the technology in terms of assaying large numbers of cells [14]; however, all technologies are plagued by intrinsically low coverage in terms of numbers of CpGs assayed.
In this paper, we have proposed Melissa as a way of addressing the low coverage issue by sharing information between CpGs with a local smoothing and between cells with a Bayesian clustering prior. On both synthetic and real data, Melissa achieved stateofthe art imputation performance over a panel of competing methods, including DeepCpG [24] and random forests. While achieving comparable or superior performance to blackbox methods, such as neural networks and random forests, Melissa is more transparent and needs minimal tuning: all the results shown, on both synthetic and real data, were obtained with the same settings of the algorithm. Additionally, as all Bayesian methods, Melissa outputs are probability distributions that fully quantify the uncertainty on the model’s prediction, and which are more easily usable for further experimental design compared to the pointestimates provided by blackbox approaches. Melissa does not require additional annotation data as in [23] or [28] and does not exploit sequence information like DeepCpG, but an extension leveraging side data would be easily accomplished within the Bayesian framework and would represent an interesting extension for future research. By using a Bayesian clustering prior, Melissa has the added benefit of simultaneously uncovering the population structure within the assay, as we demonstrated in the real data examples; Melissa can therefore be a useful tool in uncovering epigenetic diversity among cells.
In addition, in this work, Melissa was applied on predefined genomic regions of interest, such as promoters and enhancers; however, one could easily perform genomewide imputation and clustering of singlecell methylomes by using a sliding (nonoverlapping) window approach. While this paper was under review, we became aware of a new preprint describing Epiclomal [29], a method to perform clustering of singlecell DNA methylomes using a Bayesian probabilistic model. Epiclomal shares a similar hierarchical structure to Melissa and also models bisulfite conversion error; however, Epiclomal does not model the spatial variability of neighboring CpGs and therefore cannot perform imputation as Melissa does.
While Melissa accounts for heterogeneity in the cell population structure, it does not allow for heterogeneity at the singlegene level: each cluster has a single methylation profile within each region, and all variability at the single locus level is attributed to noise. This rigidity limits the usefulness of Melissa as a tool to investigate intrinsic stochasticity in methylation at the single locus level. Relaxing the modeling assumptions to accommodate methylation variability in Melissa is an interesting topic for future research. Another area where Melissa could be fruitfully applied is the integrative study of multiple highthroughput features in single cells. Recently, Kapourani and Sanguinetti [16] showed that features extracted from methylation profiles could be effectively used to predict gene expression in bulk experiments. With the advent of novel technologies measuring gene expression and multiple epigenomic features in individual cells [13], interpretable Bayesian models like Melissa are likely to play an important role in furthering our understanding of epigenetic control of gene expression in single cells.
Methods
Melissa model
In order to provide spatial smoothing of the methylation profiles at specific regions, we adapt a generalized linear model of basis function regression proposed recently [16] and further extended and implemented in the BPRMeth Bioconductor package [21]. The basic idea of BPRMeth is as follows: the methylation profile associated with a genomic region m is defined as a (latent) function f:m→(0,1) which takes as input the genomic coordinate along the region and returns the propensity for that locus to be methylated. For singlecell methylation data, methylation of individual CpG sites can be naturally modeled using a Bernoulli observation model, since for the majority of covered sites we have binary CpG methylation states (see Additional file 1: Figure S13). More specifically, for a specific region m, we model the observed methylation of CpG site i as:
where the unknown “true” methylation level ρ_{mi} has as covariates the CpG locations x_{mi}. Then, we define the BPRMeth model as:
where w_{m} are the regression coefficients, x_{mi}≡h(x_{mi}) are the basis function transformed CpG locations (here we consider radial basis functions (RBFs)), and g(·) is the link function that allows us to move from the systematic components η_{mi} to mean parameters ρ_{mi}. Here we consider a probit regression model which is obtained by defining g^{−1}(·) = Φ(·) — where Φ(·) denotes the cdf of the standard normal distribution—ensuring that f takes values in the [0,1] interval. Notice that both BPRMeth and Melissa do not explicitly model bisulfite conversion errors. Conversion errors are estimated to be relatively rare and below 1% [30], and we show in our simulation studies that Melissa is highly robust to the addition of noise mimicking possible errors.
To account for the limited CpG coverage of scBSseq experiments, the BPRMeth model was recently reformulated in a Bayesian framework [21]. The model was made amenable to Bayesian estimation thanks to a data augmentation strategy [31]. This strategy consists of introducing an additional auxiliary latent variable z_{i}, which has a Gaussian distribution conditioned on the input w^{⊤}x_{i}, leading to the graphical model in Fig. 6.
The BPRMeth model is limited to sharing informati.on across CpGs via local smoothing (which certainly helps in dealing with data sparsity); however, in our experience the coverage in scBSseq data is insufficient to infer informative methylation profiles at many genomic regions. We therefore propose Melissa to exploit the population structure of the experimental design and additionally share and transfer information across cells.
Assume that we have N(n=1,...,N) cells and each cell consists of M(m=1,...,M) genomic regions, for example promoters, and we are interested in both partitioning the cells in K clusters and inferring the methylation profiles for each genomic region. To do so, we use a finite Dirichlet mixture model (FDMM) [32], where we assume that the methylation profile of the m^{th} region for each cell n is drawn from a mixture distribution with K components (where K<N). This way, cells belonging to the same cluster will share the same methylation profile, although profiles will still differ across genomic regions. Let c_{n} be a latent variable comprising a 1ofK binary vector with elements c_{nk} representing the component that is responsible for cell n, and π_{k} be the probability that a cell belongs to cluster k, i.e. π_{k}=p(c_{nk}=1). The conditional distribution of C={c_{1},…,c_{N}} given π is:
Considering the FDMM as a generative model, the latent variables c_{n} will generate the latent observations \(\mathbf {Z}_{n} \in \mathbb {R}^{M\times I_{m}}\), which in turn will generate the binary observations \(\mathbf {Y}_{n} \in \{0, 1\}^{M\times I_{m}}\) depending on the sign of Z_{n}, as shown in Fig. 6. The conditional distribution of the data (Z,Y), given the latent variables C and the component parameters W, becomes:
where
To complete the model, we introduce priors over the parameters. We choose a Dirichlet distribution over the mixing proportions, \(p(\boldsymbol {\pi }) = \mathcal {D}\text {ir}(\boldsymbol {\pi } \thinspace  \thinspace \boldsymbol {\delta }_{0})\), where for symmetry we choose the same parameter \(\delta _{0_{k}}\) for each of the mixture components. We also introduce an independent Gaussian prior over the coefficients W, that is:
Finally, we introduce a prior distribution for the (hyper)parameter τ and assume that each cluster has its own precision parameter, \(p(\tau _{k}) = \mathcal {G}\text {amma}(\tau _{k} \thinspace  \thinspace \alpha _{0}, \beta _{0})\). Having defined our model, we can now write the joint distribution over the observed and latent variables:
where the factorization corresponds to the probabilistic graphical model shown in Fig. 7, resulting in the following hierarchical model:
Importantly, Melissa is a hybrid between a global unsupervised clustering model and a local supervised prediction model, encoded through the GLM regression coefficients w for each genomic region. When considering Melissa as an imputation (or predictive) model, the training data are coming by using only a subset of CpG tuples (x_{nmi},y_{nmi}) for each region. For example, from the observed I_{nm} CpGs in a given region, Melissa will only see I_{nm}/2 random CpGs during training, and the remaining CpGs will be used as a held out test set to evaluate its prediction performance. Note that in any case, either using all CpGs or a subset during training, Melissa will additionally perform clustering at the global level which is encoded through the latent variables c_{n}.
Variational inference
The posterior distribution of the latent variables given the observed data p(Z,C,W,π,τ  Y,X) for the Melissa model is not analytically tractable; hence, we resort to approximate techniques. The most common method for approximate Bayesian inference is to perform Markov Chain Monte Carlo (MCMC) [33]; however, sampling methods require considerable computational resources and do not scale well when performing genomewide analysis on hundreds or thousands of single cells. Variational methods can provide an efficient, approximate solution with better scalability in this case (see the “Results” section for a comparison between Gibbs sampling and variational inference for this model). More specifically, we use meanfield variational inference [34] which assumes that the approximating distribution factorizes over the latent variables:
Detailed mathematical derivations of the optimal variational factors are available in Additional file 1: Section 1. Next, we iteratively update each factor q while holding the remaining factors fixed using the coordinate ascent variational inference (CAVI) algorithm which is summarized in Algorithm 1.
Predictive density and model selection
Given an approximate posterior distribution, we are in the position to predict the methylation level at unobserved CpG sites. The predictive density of a new observation y_{∗}, which is associated with latent variables c_{∗}, z_{∗} and covariates X_{∗}, is given by:
where we collectively denote as θ the relevant parameters being marginalized.
It has been repeatedly observed [26] that, when fitting variationally a mixture model with a large number of components, the variational procedure will prune away components with no support in the data, hence effectively determining an appropriate number of clusters in an automatic fashion, i.e., perform model selection. We can gain some intuition as to why this happens in the following way. We can rewrite the KullbackLeibler (\(\mathcal {KL}\)) divergence as:
where lnp(X) can be ignored since it is constant with respect to q(θ). To minimize this objective function, the variational approximation will both try to increase the expected log likelihood of the data lnp(X  θ) while minimizing its \(\mathcal {KL}\) divergence with the prior distribution p (θ). Hence, using variational Bayes, we have an automatic tradeoff between fitting the data and model complexity [19].
Assessing Melissa via a simulation study
To generate realistic simulated singlecell methylation data, we first used the BPRMeth package [21] to infer five prototypical methylation profiles from the GM12878 lymphoblastoid cell line. The bulk BSseq data for the GM12878 cell line are publicly available from the ENCODE project [35]. Based on these profiles, we simulated singlecell methylation data (i.e., binary CpG methylation states) for M = 100 genomic regions, where each CpG was generated by sampling from a Bernoulli distribution with probability of success given by the latent function evaluation at the specific site. To mimic the inherent noise introduced by bisulfite conversion error, Gaussian noise \(\mathcal {N}(\mu = 0, \sigma = 0.05)\) was introduced to the probability of success prior to generating each binary CpG site. This process can be thought of as generating methylation data for a specific single cell. Next, we generated K = 4 cell subpopulations by randomly shuffling the genomic regions across clusters, so now each cell subpopulation has its own methylome landscape. In total, we generated N = 200 cells, with the following cell subpopulation proportions: 40%, 25%, 20%, and 15%. Additionally, to account for different levels of similarity between cell subpopulations, we simulated 11 different data sets by varying the proportion of similar genomic regions between clusters. Finally, to assess the performance of Melissa for varying number of cells assayed, we simulated 10 different data sets by varying the total number of single cells N. The scripts (written in the R programming language) for this simulation study are publicly available on the Melissa repository.
Assessing Melissa on subsampled bulk ENCODE data
To faithfully simulate methylation data that resemble scBSseq experiments, we generated two additional synthetic data sets by subsampling bulk ENCODE RRBS (GEO: GSE27584) and WGBS (GEO: GSE80911 for H1hESC and GSE86765 for GM12878) data, each consisting of two different cell lines, H1hESC and GM12878. The RRBS data are enriched for genomic regions with high CpG content (using methylation sensitive restriction enzymes such as MspI that recognizes CCGG motifs) which predominantly reside near promoter regions and CpG islands. On the other hand, WGBS experiments in theory can assay the whole methylome landscape of the human genome; however, they require highsequencing depth to obtain an accurate estimate of the bulk methylation level at each CpG site. To retain the structure of missing data observed in scBSseq experiments (due to read length), we directly subsampled the raw FASTQ files which essentially lead to discarding individual reads rather than individual CpGs. For the RRBS data set, from each cell line, we generated 40 pseudosingle cells by randomly keeping 10% of the mapped reads from the bulk experiment, resulting in 80 cells when combining both cell lines. For the WGBS data set, the same number of pseudosingle cells was generated from each cell line, with the only difference that only 0.5% of the mapped reads were retained from the bulk data due to the highsequencing depth of the experiments. This process was performed for chromosomes 1 to 6 to alleviate the computational burden. Subsequently, the same preprocessing steps detailed in the previous section were performed, with the only difference that for this study we considered only ± 2.5 kb and ± 5 kb promoter regions around TSS. Each model, except DeepCpG, used 20%, 50%, and 80% of the CpGs as training set, and the remaining of CpGs were used as a test set to evaluate imputation performance. The DeepCpG model used chromosomes 1 and 3 as training set, chromosome 5 as validation set, and the remaining chromosomes as test set.
scBSseq data and preprocessing
Singlecell bisulfite sequencing protocols provide us with single basepair resolution of CpG methylation states. Since we assay the DNA of a single cell, the methylation level for each CpG site is predominantly binary, either methylated or unmethylated. However, due to each chromosome having two copies, a small proportion of CpG sites have a nonbinary nature (see Additional file 1: Figure S19). To avoid ambiguities, hemimethylated sites—sites with 50% methylation level—are filtered prior to downstream analysis, and for the remaining sites, binary methylation states are obtained from the ratio of methylated read counts to total read counts [11].
Two mouse embryonic stem cells (ESCs) data sets were used to validate the performance of the Melissa model. The scM&Tseq data set [11] after quality assessment consisted of 75 single cells out of which 14 cells were cultured in 2i medium (2i ESCs) and the remaining 61 cells were cultured in serum conditions (serum ESCs). The Bismark [36] processed data, with reads mapped to the GRCm38 mouse genome, were downloaded from the Gene Expression Omnibus under accession GSE74535. The scBSseq data set [8] contained 32 cells out of which 12 cells were 2i ESCs and the remaining 20 cells were serum ESCs, and the Bismark processed data, with reads mapped to the GRCm38 mouse genome, are publicly available under accession number GSE56879. For both data sets, the observed data that are used as input to Melissa are binary methylation states: unmethylated CpGs are encoded with zero and methylated CpGs with one. We should note that this is the standard procedure for processing scBSseq data [8] and additional information and visualizations regarding the quality of the scBSseq data can be found in the original publications.
Since Melissa considers genomic regions for a specific genomic context, we use the BPRMeth package [21] to filter CpGs that do not fall inside these regions, and create a simple data structure where each cell is a encoded as a list, and each entry of the list—corresponding to a specific genomic region—is a matrix with two columns: the (relative) CpG location and the methylation state. We considered six different genomic contexts where we applied Melissa: protein coding promoters with varying genomic windows: ± 1.5 kb, ± 2.5 kb, and ± 5 kb around transcription start sites (TSS), active enhancers, super enhancers, and Nanog regulatory regions. Due to the sparse CpG coverage, for the three genomic contexts except promoters, we filtered loci with smaller than 1 kb annotation length, and specifically for Nanog regions, we took a window of ± 2.5 kb around the center of the genomic annotation. In addition, we only considered regions that were covered in at least 50% of the cells with a minimum coverage of 10 CpGs and had between cell variability; the rationale being that homogeneous regions across cells do not provide additional information for identifying cell subpopulations. The CpG coverage distribution after the filtering process across different genomic contexts is shown in Additional file 1: Figure S20 and S21. The sparsity level of the two scBSseq data sets across different genomic contexts is shown in Additional file 1: Table S3. It should be noted that imputation performance is evaluated only on genomic regions that pass the filtering threshold. We run the model with K=6 and K=5 clusters for the scM&Tseq and scBSseq data sets, respectively, and we use a broad prior over the model parameters.
Performance evaluation
To assess model performance across all genomic contexts, we partition the data and use 50% of the CpGs in each cell and region for training set and the remaining 50% as test set (except DeepCpG, see below). The prediction performance of all competing models, except DeepCpG, was evaluated on imputing all missing CpG states in a given region at once. For computing binary evaluation metrics, such as Fmeasure, predicted probabilities above 0.5 were set to one and rounded to zero otherwise.
Fmeasure The Fmeasure or F_{1}score is the harmonic mean of precision and recall:
Gaussian mixture model The input to the Gaussian mixture model (GMM) is the average methylation rate across the region; since rates are between (0,1), we transform them to M values, which follow closer the Gaussian distribution [22]. The transformation from average methylation rates to average M values is obtained by:
Adjusted Rand Index The Adjusted Rand Index (ARI) is a measure of the similarity between two data clusterings:
DeepCpG
The DeepCpG method takes a different imputation approach: it is trained on a specific set of chromosomes and predicts methylation states on the remaining chromosomes where it imputes each CpG site sequentially by using as input a set of neighboring CpG sites. This approach makes it difficult to equally compare with the rival methods, since for each CpG the input features to DeepCpG are all the neighboring sites, whereas the competing models have access to a subset of the data and they make predictions in one pass for the whole region. Since we only had access to CpG methylation data and to make it comparable with the considered methods, we trained the CpG module of DeepCpG (termed DeepCpG CpG in [24]).
For the scM&Tseq data set, chromosomes 3 and 17 were used as training set, chromosomes 12 and 14 as validation set and the remaining chromosomes as test set. For the scBSseq data set, chromosomes 3, 17, and 19 were used as training set; chromosomes 12 and 14 as validation set; and the remaining chromosomes as test set. The chosen chromosomes had at least three million CpGs used as training set, a sensible size for the DeepCpG model as suggested by the authors. A neighborhood of K = 20 CpG sites to the left and the right for each target CpG was used as input to the model. During testing time, even if a given genomic region did not contain at least 40 CpGs, the DeepCpG model used additional CpGs outside this window to predict methylation states, hence using more information compared to the rival models. In total, the DeepCpG model took around 4 days per data set for training and prediction on a cluster equipped with NVIDIA Tesla K40ms GPUs.
Abbreviations
 ARI:

Adjusted Rand Index
 AUC:

Area under the receiver operating characteristic curve
 BPRMeth:

Bayesian probit regression for methylation
 bp:

Base pair
 CAVI:

Coordinate ascent variational inference
 CPU:

Central processing unit
 ESC:

Embryonic stem cell
 GEO:

Gene expression omnibus
 GLM:

Generalized linear model
 GMM:

Gaussian mixture model
 GPU:

Graphics processing unit
 MCMC:

Markov chain Monte Carlo
 RBF:

Radial basis function
 RF:

Random Forest
 scBSseq:

Singlecell bisulfite sequencing
 scRRBS:

Singlecell reduced representation bisulfite sequencing
 TSS:

Transcription start site
 WGBS:

Whole genome bisulfite sequencing
References
 1
Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002; 16(1):6–21. Available from: http://www.ncbi.nlm.nih.gov/pubmed/11782440.
 2
Baylin SB, Jones Pa. A decade of exploring the cancer epigenome  biological and translational implications. Nat Rev Cancer. 2011; 11(10):726–34. Available from: https://doi.org/10.1038/nrc3130.
 3
Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012; 13(7):484–92. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22641018.
 4
Krueger F, Kreck B, Franke A, Andrews SR. DNA methylome analysis using short bisulfite sequencing data. Nat Methods. 2012; 9(2):145–51. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22290186.
 5
Shapiro E, Biezuner T, Linnarsson S. Singlecell sequencingbased technologies will revolutionize wholeorganism science. Nat Rev Genet. 2013; 14(9):618–30. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23897237.
 6
Schwartzman O, Tanay A. Singlecell epigenomics: techniques and emerging applications. Nat Rev Genet. 2015; 16(12):716–26. Available from: https://doi.org/10.1038/nrg3980.
 7
Kelsey G, Stegle O, Reik W. Singlecell epigenomics: Recording the past and predicting the future. Science. 2017; 358(6359):69–75. Available from: https://doi.org/10.1017/S0022215115001383.
 8
Smallwood Sa, Lee HJ, Angermueller C, Krueger F, Saadeh H, Peat J, et al. Singlecell genomewide bisulfite sequencing for assessing epigenetic heterogeneity. Nat Methods. 2014; 11(8):817–20. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25042786.
 9
Guo H, Zhu P, Wu X, Li X, Wen L, Tang F. Singlecell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing. Genome Res. 2013:2126–35. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24179143.
 10
Farlik M, Sheffield NC, Nuzzo A, Datlinger P, Schönegger A, Klughammer J, et al. SingleCell DNA Methylome Sequencing and Bioinformatic Inference of Epigenomic CellState Dynamics. Cell Rep. 2015; 10(8):1386–97. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25732828.
 11
Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX, et al. Parallel singlecell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods. 2016; 13(3):229–32. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26752769.
 12
Hou Y, Guo H, Cao C, Li X, Hu B, Zhu P, et al. Singlecell triple omics sequencing reveals genetic, epigenetic, and transcriptomic heterogeneity in hepatocellular carcinomas. Cell Res. 2016; 26(3):304–19. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26902283.
 13
Clark SJ, Argelaguet R, Kapourani CA, Stubbs TM, Lee HJ, AldaCatalinas C, et al. ScNMTseq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat Commun. 2018; 9(1):1–9. Available from: https://doi.org/10.1038/s41467018031494. http://www.ncbi.nlm.nih.gov/pubmed/29472610.
 14
Luo C, Keown CL, Kurihara L, Zhou J, He Y, Li J, et al. Singlecell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science. 2017; 357(6351):600–4. Available from: http://www.ncbi.nlm.nih.gov/pubmed/28798132.
 15
Mulqueen RM, Pokholok D, Norberg SJ, Torkenczy KA, Fields AJ, Sun D, et al. Highly scalable generation of DNA methylation profiles in single cells. Nat Biotechnol. 2018; 36(5):428–31. Available from: http://www.ncbi.nlm.nih.gov/pubmed/29644997.
 16
Kapourani CA, Sanguinetti G. Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics. 2016; 32(17):i405–12. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27587656. https://doi.org/10.1093/bioinformatics/btw432.
 17
Mayo TR, Schweikert G, Sanguinetti G. M 3 D: A kernelbased test for spatially correlated changes in methylation profiles. Bioinformatics. 2015; 31(6):809–16. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25398611.
 18
Vanderkraats ND, Hiken JF, Decker KF, Edwards JR. Discovering highresolution patterns of differential DNA methylation that correlate with gene expression changes. Nucleic Acids Research. 2013; 41(14):6816–6827. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23748561.
 19
Bishop CM. Pattern recognition and machine learning: Springer; 2006. Available from: http://www.library.wisc.edu/selectedtocs/bg0137.pdf.
 20
Powers DMW. Evaluation: From Precision, Recall and FMeasure to ROC, Informedness, Markedness & Correlation. J Mach Learn Technol. 2011; 2(1):37–63. Available from: https://doi.org/10.1.1.214.9232.
 21
Kapourani CA, Sanguinetti G. BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics (Oxford, England). 2018; 34(14):2485–6. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27587656. https://doi.org/10.1093/bioinformatics/bty129.
 22
Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, et al. Comparison of Betavalue and Mvalue methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010; 11(1):587. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21118553.
 23
Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE. Predicting genomewide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol. 2015; 16(1):14. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25616342.
 24
Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of singlecell DNA methylation states using deep learning. Genome Biol. 2017; 18(1):67. Available from: http://www.ncbi.nlm.nih.gov/pubmed/28395661.
 25
Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193–218. Available from: https://doi.org/10.1007/BF01908075.
 26
Corduneanu A, Bishop CM. Variational Bayesian Model Selection for Mixture Distributions. In: In Artificial Intelligence and Statistics: 2001. p. 27–34. Available from: https://doi.org/10.1016/j.csda.2006.07.020.
 27
Ficz G, Hore TA, Santos F, Lee HJ, Dean W, Arand J, et al. FGF signaling inhibition in ESCs drives rapid genomewide demethylation to the epigenetic ground state of pluripotency. Cell Stem Cell. 2013; 13(3):351–9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23850245.
 28
Ernst J, Kellis M. Largescale imputation of epigenomic datasets for systematic annotation of diverse human tissues. Nat Biotechnol. 2015; 33(4):364–76. Available from: https://doi.org/10.1038/nbt.3157.
 29
de Souza CPE, Andronescu M, Masud T, Kabeer F, Biele J, Laks E, et al. Epiclomal: probabilistic clustering of sparse singlecell DNA methylation data. bioRxiv. 2018;414482.
 30
Genereux DP, Johnson WC, Burden AF, Stöger R, Laird CD. Errors in the bisulfite conversion of DNA: modulating inappropriateand failedconversion frequencies. Nucleic Acids Res. 2008; 36(22):e150. Available from: http://www.ncbi.nlm.nih.gov/pubmed/18984622.
 31
Albert JH, Chib S. Bayesian Analysis of Binary and Polychotomous Response Data. J Am Stat Assoc. 1993; 88(422):669–79. Available from: https://doi.org/10.2307/2290350.
 32
McLachlan G, Peel D. Finite mixture models. New York: Wiley; 2004. https://doi.org/10.1002/0471721182.
 33
Gelfand A, Smith AFM. SamplingBased Approaches to Calculating Marginal Densities. J Am Stat Assoc. 1990; 85(410):398–409. Available from: https://doi.org/10.2307/2289776.
 34
Blei DM, Kucukelbir A, McAuliffe JD. Variational Inference: A Review for Statisticians. J Am Stat Assoc. 2017; 112(518):859–77. Available from: https://doi.org/10.1080/01X00000.621459.2017.1285773.
 35
Dunham I, Kundaje A. Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22955616.
 36
Krueger F, Andrews SR. Bismark: A flexible aligner and methylation caller for BisulfiteSeq applications. Bioinformatics. 2011; 27(11):1571–2. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21493656.
 37
Kapourani CA, Sanguinetti G. Melissa: Bayesian clustering and imputation of single cell methylomes. Github repository:, https://github.com/andreaskapou/Melissa. 2019. Available from: https://doi.org/10.5281/zenodo.2567427.
 38
Kapourani CA, Sanguinetti G. Melissa: Bayesian clustering and imputation of single cell methylomes. Bioconductor repository. 2019. Available from: https://doi.org/doi:10.18129/B9.bioc.Melissa. http://bioconductor.org/packages/Melissa.
Acknowledgements
We thank Duncan Sproul and Jon Higham for discussion and help with bioinformatics pipeline analysis and Oliver Stegle, Michalis Michaelides, Ricard Argelaguet, and Stephen Clark for valuable comments and discussion.
Funding
CAK is a crossdisciplinary postdoctoral fellow supported by funding from the University of Edinburgh, Medical Research Council (core grant to the MRC Institute of Genetics and Molecular Medicine), and the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1).
Author information
Affiliations
Contributions
Both authors conceived the study, carried out the data analysis, and wrote the paper. CAK implemented and evaluated the method. Both authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Ethical approval was not needed for this study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Availability of data and materials
The Melissa model is publicly available as R software released under the GNU GPL3 licence (Github: https://github.com/andreaskapou/Melissa [37], Bioconductor: http://bioconductor.org/packages/Melissa/ [38], and DOI: https://doi.org/10.5281/zenodo.2567427). The scBSseq data from mouse ESCs from [8] are available under GEO accession number GSE56879. The scM&Tseq data from mouse ESCs from [11] are available under GEO accession number GSE74535.
Additional file
Additional file 1
Melissa meanfield variational inference derivations (section 1), additional figures (section 2), and additional tables (Section 3). (PDF 884 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Kapourani, C., Sanguinetti, G. Melissa: Bayesian clustering and imputation of singlecell methylomes. Genome Biol 20, 61 (2019). https://doi.org/10.1186/s1305901916658
Received:
Accepted:
Published: