Melissa model
In order to provide spatial smoothing of the methylation profiles at specific regions, we adapt a generalized linear model of basis function regression proposed recently [16] and further extended and implemented in the BPRMeth Bioconductor package [21]. The basic idea of BPRMeth is as follows: the methylation profile associated with a genomic region m is defined as a (latent) function f:m→(0,1) which takes as input the genomic coordinate along the region and returns the propensity for that locus to be methylated. For single-cell methylation data, methylation of individual CpG sites can be naturally modeled using a Bernoulli observation model, since for the majority of covered sites we have binary CpG methylation states (see Additional file 1: Figure S13). More specifically, for a specific region m, we model the observed methylation of CpG site i as:
$$ y_{mi} \sim \mathcal{B}\text{ern}(\rho_{mi}), $$
(1)
where the unknown “true” methylation level ρmi has as covariates the CpG locations xmi. Then, we define the BPRMeth model as:
$$ \begin{aligned} \eta_{mi} & = \mathbf{w}_{m}^{\top}\mathbf{h}(x_{mi}), \\ f_{m}(x_{mi}) & = \rho_{mi} = g^{-1}(\eta_{mi}), \end{aligned} $$
(2)
where wm are the regression coefficients, xmi≡h(xmi) are the basis function transformed CpG locations (here we consider radial basis functions (RBFs)), and g(·) is the link function that allows us to move from the systematic components ηmi to mean parameters ρmi. Here we consider a probit regression model which is obtained by defining g−1(·) = Φ(·) — where Φ(·) denotes the cdf of the standard normal distribution—ensuring that f takes values in the [0,1] interval. Notice that both BPRMeth and Melissa do not explicitly model bisulfite conversion errors. Conversion errors are estimated to be relatively rare and below 1% [30], and we show in our simulation studies that Melissa is highly robust to the addition of noise mimicking possible errors.
To account for the limited CpG coverage of scBS-seq experiments, the BPRMeth model was recently reformulated in a Bayesian framework [21]. The model was made amenable to Bayesian estimation thanks to a data augmentation strategy [31]. This strategy consists of introducing an additional auxiliary latent variable zi, which has a Gaussian distribution conditioned on the input w⊤xi, leading to the graphical model in Fig. 6.
The BPRMeth model is limited to sharing informati.on across CpGs via local smoothing (which certainly helps in dealing with data sparsity); however, in our experience the coverage in scBS-seq data is insufficient to infer informative methylation profiles at many genomic regions. We therefore propose Melissa to exploit the population structure of the experimental design and additionally share and transfer information across cells.
Assume that we have N(n=1,...,N) cells and each cell consists of M(m=1,...,M) genomic regions, for example promoters, and we are interested in both partitioning the cells in K clusters and inferring the methylation profiles for each genomic region. To do so, we use a finite Dirichlet mixture model (FDMM) [32], where we assume that the methylation profile of the mth region for each cell n is drawn from a mixture distribution with K components (where K<N). This way, cells belonging to the same cluster will share the same methylation profile, although profiles will still differ across genomic regions. Let cn be a latent variable comprising a 1-of-K binary vector with elements cnk representing the component that is responsible for cell n, and πk be the probability that a cell belongs to cluster k, i.e. πk=p(cnk=1). The conditional distribution of C={c1,…,cN} given π is:
$$ p(\mathbf{C} \thinspace | \thinspace \boldsymbol{\pi}) = \prod_{n=1}^{N}\prod_{k=1}^{K}\pi_{k}^{c_{nk}}. $$
(3)
Considering the FDMM as a generative model, the latent variables cn will generate the latent observations \(\mathbf {Z}_{n} \in \mathbb {R}^{M\times I_{m}}\), which in turn will generate the binary observations \(\mathbf {Y}_{n} \in \{0, 1\}^{M\times I_{m}}\) depending on the sign of Zn, as shown in Fig. 6. The conditional distribution of the data (Z,Y), given the latent variables C and the component parameters W, becomes:
$$ {\begin{aligned} p(\mathbf{Y}, \mathbf{Z} \thinspace | \thinspace \mathbf{C}, \mathbf{W}, \mathbf{X}) & = \prod_{n=1}^{N}\prod_{k=1}^{K}\left[\prod_{m=1}^{M}p(\mathbf{y}_{nm} \thinspace | \thinspace \mathbf{z}_{nm}) \thinspace p(\mathbf{z}_{nm} \thinspace | \thinspace \mathbf{w}_{mk}, \mathbf{X}_{nm})\right]^{c_{nk}}, \end{aligned}} $$
(4)
where
$$p(\mathbf{y}_{nm} \thinspace | \thinspace \mathbf{z}_{nm}) = \mathbb{I}(\mathbf{z}_{nm} > 0)^{\mathbf{y}_{nm}} \mathbb{I}(\mathbf{z}_{nm} \leq 0)^{(\pmb{1} - \mathbf{y}_{nm})}. $$
To complete the model, we introduce priors over the parameters. We choose a Dirichlet distribution over the mixing proportions, \(p(\boldsymbol {\pi }) = \mathcal {D}\text {ir}(\boldsymbol {\pi } \thinspace | \thinspace \boldsymbol {\delta }_{0})\), where for symmetry we choose the same parameter \(\delta _{0_{k}}\) for each of the mixture components. We also introduce an independent Gaussian prior over the coefficients W, that is:
$$ p(\mathbf{W} \thinspace | \thinspace \boldsymbol{\tau}) = \prod_{m=1}^{M}\prod_{k=1}^{K} \mathcal{N}(\mathbf{w}_{mk} \thinspace | \thinspace \mathbf{0}, \tau_{k}^{-1} \mathbf{I}). $$
(5)
Finally, we introduce a prior distribution for the (hyper)-parameter τ and assume that each cluster has its own precision parameter, \(p(\tau _{k}) = \mathcal {G}\text {amma}(\tau _{k} \thinspace | \thinspace \alpha _{0}, \beta _{0})\). Having defined our model, we can now write the joint distribution over the observed and latent variables:
$$ {\begin{aligned} p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \mathbf{W}, \boldsymbol{\pi}, \boldsymbol{\tau} \thinspace | \thinspace \mathbf{X}) = & p(\mathbf{Y} \thinspace | \thinspace \mathbf{Z}) \thinspace p(\mathbf{Z} \thinspace | \thinspace \mathbf{C}, \mathbf{W}, \mathbf{X}) \thinspace p(\mathbf{C} \thinspace | \thinspace \boldsymbol{\pi}) \thinspace p(\boldsymbol{\pi}) \thinspace p(\mathbf{W} \thinspace | \thinspace \boldsymbol{\tau}) \thinspace p(\boldsymbol{\tau}), \end{aligned}} $$
(6)
where the factorization corresponds to the probabilistic graphical model shown in Fig. 7, resulting in the following hierarchical model:
$$ \begin{aligned} \boldsymbol{\pi} & \sim \mathcal{D}\text{ir}(\boldsymbol{\delta}_{0}) \\ \mathbf{c}_{n} \thinspace | \thinspace \boldsymbol{\pi} & \sim \mathcal{D}\text{iscrete}(\boldsymbol{\pi}) \\ \tau_{k} & \sim \mathcal{G}\text{amma}(\alpha_{0}, \beta_{0}) \\ \mathbf{w}_{mk} \thinspace | \thinspace \tau_{k} & \sim \mathcal{N}(\mathbf{0}, \tau_{k}^{-1}\mathbf{I}) \\ z_{nmi} \thinspace | \thinspace \mathbf{w}_{mk}, \mathbf{x}_{nmi} & \sim \mathcal{N}(\mathbf{w}_{mk}^{\top}\mathbf{x}_{nmi}, 1) \\ y_{nmi} \thinspace | \thinspace z_{nmi} & =\left\{ \begin{array}{ll} 1 & \; \text{if}~ z_{nmi} > 0\\ 0 & \; \text{if}~ z_{nmi} \leq 0.\\ \end{array} \right. \end{aligned} $$
Importantly, Melissa is a hybrid between a global unsupervised clustering model and a local supervised prediction model, encoded through the GLM regression coefficients w for each genomic region. When considering Melissa as an imputation (or predictive) model, the training data are coming by using only a subset of CpG tuples (xnmi,ynmi) for each region. For example, from the observed Inm CpGs in a given region, Melissa will only see Inm/2 random CpGs during training, and the remaining CpGs will be used as a held out test set to evaluate its prediction performance. Note that in any case, either using all CpGs or a subset during training, Melissa will additionally perform clustering at the global level which is encoded through the latent variables cn.
Variational inference
The posterior distribution of the latent variables given the observed data p(Z,C,W,π,τ | Y,X) for the Melissa model is not analytically tractable; hence, we resort to approximate techniques. The most common method for approximate Bayesian inference is to perform Markov Chain Monte Carlo (MCMC) [33]; however, sampling methods require considerable computational resources and do not scale well when performing genome-wide analysis on hundreds or thousands of single cells. Variational methods can provide an efficient, approximate solution with better scalability in this case (see the “Results” section for a comparison between Gibbs sampling and variational inference for this model). More specifically, we use mean-field variational inference [34] which assumes that the approximating distribution factorizes over the latent variables:
$$ q(\mathbf{Z}, \mathbf{C}, \mathbf{W}, \boldsymbol{\pi}, \boldsymbol{\tau}) = q(\mathbf{Z})\thinspace q(\mathbf{C}) \thinspace q(\mathbf{W}) \thinspace q(\boldsymbol{\pi}) \thinspace q(\boldsymbol{\tau}). $$
(7)
Detailed mathematical derivations of the optimal variational factors are available in Additional file 1: Section 1. Next, we iteratively update each factor q while holding the remaining factors fixed using the coordinate ascent variational inference (CAVI) algorithm which is summarized in Algorithm 1.
Predictive density and model selection
Given an approximate posterior distribution, we are in the position to predict the methylation level at unobserved CpG sites. The predictive density of a new observation y∗, which is associated with latent variables c∗, z∗ and covariates X∗, is given by:
$$ {\begin{aligned} p(\mathbf{y}_{*} \thinspace | \thinspace \mathbf{X}_{*},\mathbf{Y}) & = \sum_{c_{*}}\int\int p(\mathbf{y}_{*}, \mathbf{c}_{*}, \mathbf{z}_{*}, \boldsymbol{\theta} \thinspace | \thinspace \mathbf{X}_{*},\mathbf{Y}) d\boldsymbol{\theta}d\mathbf{z}_{*} \\ & \simeq \sum_{k=1}^{K}\frac{\delta_{k}}{\sum_{j}\delta_{j} }\mathcal{B}\text{ern}\left(\mathbf{y}_{*} \Big{|} \Phi \left(\frac{\mathbf{X}_{*}\boldsymbol{\lambda}_{k}}{\sqrt{\mathbf{I} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)}}\right)\right) \end{aligned}} $$
(8)
where we collectively denote as θ the relevant parameters being marginalized.
It has been repeatedly observed [26] that, when fitting variationally a mixture model with a large number of components, the variational procedure will prune away components with no support in the data, hence effectively determining an appropriate number of clusters in an automatic fashion, i.e., perform model selection. We can gain some intuition as to why this happens in the following way. We can rewrite the Kullback-Leibler (\(\mathcal {KL}\)) divergence as:
$$ {\begin{aligned} \mathcal{KL}(q(\boldsymbol{\theta})\;||\;p(\boldsymbol{\theta} \thinspace | \thinspace \mathbf{X})) = & \ln p(\mathbf{X}) - \left\langle \ln p(\mathbf{X} \thinspace | \thinspace \boldsymbol{\theta}) \right\rangle_{q(\boldsymbol{\theta})} + \mathcal{KL}(q(\boldsymbol{\theta})\;||\;p(\boldsymbol{\theta})) \end{aligned}} $$
(9)
where lnp(X) can be ignored since it is constant with respect to q(θ). To minimize this objective function, the variational approximation will both try to increase the expected log likelihood of the data lnp(X | θ) while minimizing its \(\mathcal {KL}\) divergence with the prior distribution p (θ). Hence, using variational Bayes, we have an automatic trade-off between fitting the data and model complexity [19].
Assessing Melissa via a simulation study
To generate realistic simulated single-cell methylation data, we first used the BPRMeth package [21] to infer five prototypical methylation profiles from the GM12878 lymphoblastoid cell line. The bulk BS-seq data for the GM12878 cell line are publicly available from the ENCODE project [35]. Based on these profiles, we simulated single-cell methylation data (i.e., binary CpG methylation states) for M = 100 genomic regions, where each CpG was generated by sampling from a Bernoulli distribution with probability of success given by the latent function evaluation at the specific site. To mimic the inherent noise introduced by bisulfite conversion error, Gaussian noise \(\mathcal {N}(\mu = 0, \sigma = 0.05)\) was introduced to the probability of success prior to generating each binary CpG site. This process can be thought of as generating methylation data for a specific single cell. Next, we generated K = 4 cell sub-populations by randomly shuffling the genomic regions across clusters, so now each cell sub-population has its own methylome landscape. In total, we generated N = 200 cells, with the following cell sub-population proportions: 40%, 25%, 20%, and 15%. Additionally, to account for different levels of similarity between cell sub-populations, we simulated 11 different data sets by varying the proportion of similar genomic regions between clusters. Finally, to assess the performance of Melissa for varying number of cells assayed, we simulated 10 different data sets by varying the total number of single cells N. The scripts (written in the R programming language) for this simulation study are publicly available on the Melissa repository.
Assessing Melissa on sub-sampled bulk ENCODE data
To faithfully simulate methylation data that resemble scBS-seq experiments, we generated two additional synthetic data sets by sub-sampling bulk ENCODE RRBS (GEO: GSE27584) and WGBS (GEO: GSE80911 for H1hESC and GSE86765 for GM12878) data, each consisting of two different cell lines, H1-hESC and GM12878. The RRBS data are enriched for genomic regions with high CpG content (using methylation sensitive restriction enzymes such as MspI that recognizes CCGG motifs) which predominantly reside near promoter regions and CpG islands. On the other hand, WGBS experiments in theory can assay the whole methylome landscape of the human genome; however, they require high-sequencing depth to obtain an accurate estimate of the bulk methylation level at each CpG site. To retain the structure of missing data observed in scBS-seq experiments (due to read length), we directly sub-sampled the raw FASTQ files which essentially lead to discarding individual reads rather than individual CpGs. For the RRBS data set, from each cell line, we generated 40 pseudo-single cells by randomly keeping 10% of the mapped reads from the bulk experiment, resulting in 80 cells when combining both cell lines. For the WGBS data set, the same number of pseudo-single cells was generated from each cell line, with the only difference that only 0.5% of the mapped reads were retained from the bulk data due to the high-sequencing depth of the experiments. This process was performed for chromosomes 1 to 6 to alleviate the computational burden. Subsequently, the same preprocessing steps detailed in the previous section were performed, with the only difference that for this study we considered only ± 2.5 kb and ± 5 kb promoter regions around TSS. Each model, except DeepCpG, used 20%, 50%, and 80% of the CpGs as training set, and the remaining of CpGs were used as a test set to evaluate imputation performance. The DeepCpG model used chromosomes 1 and 3 as training set, chromosome 5 as validation set, and the remaining chromosomes as test set.
scBS-seq data and preprocessing
Single-cell bisulfite sequencing protocols provide us with single base-pair resolution of CpG methylation states. Since we assay the DNA of a single cell, the methylation level for each CpG site is predominantly binary, either methylated or unmethylated. However, due to each chromosome having two copies, a small proportion of CpG sites have a non-binary nature (see Additional file 1: Figure S19). To avoid ambiguities, hemi-methylated sites—sites with 50% methylation level—are filtered prior to downstream analysis, and for the remaining sites, binary methylation states are obtained from the ratio of methylated read counts to total read counts [11].
Two mouse embryonic stem cells (ESCs) data sets were used to validate the performance of the Melissa model. The scM&T-seq data set [11] after quality assessment consisted of 75 single cells out of which 14 cells were cultured in 2i medium (2i ESCs) and the remaining 61 cells were cultured in serum conditions (serum ESCs). The Bismark [36] processed data, with reads mapped to the GRCm38 mouse genome, were downloaded from the Gene Expression Omnibus under accession GSE74535. The scBS-seq data set [8] contained 32 cells out of which 12 cells were 2i ESCs and the remaining 20 cells were serum ESCs, and the Bismark processed data, with reads mapped to the GRCm38 mouse genome, are publicly available under accession number GSE56879. For both data sets, the observed data that are used as input to Melissa are binary methylation states: unmethylated CpGs are encoded with zero and methylated CpGs with one. We should note that this is the standard procedure for processing scBS-seq data [8] and additional information and visualizations regarding the quality of the scBS-seq data can be found in the original publications.
Since Melissa considers genomic regions for a specific genomic context, we use the BPRMeth package [21] to filter CpGs that do not fall inside these regions, and create a simple data structure where each cell is a encoded as a list, and each entry of the list—corresponding to a specific genomic region—is a matrix with two columns: the (relative) CpG location and the methylation state. We considered six different genomic contexts where we applied Melissa: protein coding promoters with varying genomic windows: ± 1.5 kb, ± 2.5 kb, and ± 5 kb around transcription start sites (TSS), active enhancers, super enhancers, and Nanog regulatory regions. Due to the sparse CpG coverage, for the three genomic contexts except promoters, we filtered loci with smaller than 1 kb annotation length, and specifically for Nanog regions, we took a window of ± 2.5 kb around the center of the genomic annotation. In addition, we only considered regions that were covered in at least 50% of the cells with a minimum coverage of 10 CpGs and had between cell variability; the rationale being that homogeneous regions across cells do not provide additional information for identifying cell sub-populations. The CpG coverage distribution after the filtering process across different genomic contexts is shown in Additional file 1: Figure S20 and S21. The sparsity level of the two scBS-seq data sets across different genomic contexts is shown in Additional file 1: Table S3. It should be noted that imputation performance is evaluated only on genomic regions that pass the filtering threshold. We run the model with K=6 and K=5 clusters for the scM&T-seq and scBS-seq data sets, respectively, and we use a broad prior over the model parameters.
Performance evaluation
To assess model performance across all genomic contexts, we partition the data and use 50% of the CpGs in each cell and region for training set and the remaining 50% as test set (except DeepCpG, see below). The prediction performance of all competing models, except DeepCpG, was evaluated on imputing all missing CpG states in a given region at once. For computing binary evaluation metrics, such as F-measure, predicted probabilities above 0.5 were set to one and rounded to zero otherwise.
F-measure The F-measure or F1-score is the harmonic mean of precision and recall:
$$ F\text{-measure} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}. $$
(10)
Gaussian mixture model The input to the Gaussian mixture model (GMM) is the average methylation rate across the region; since rates are between (0,1), we transform them to M values, which follow closer the Gaussian distribution [22]. The transformation from average methylation rates to average M values is obtained by:
$$ M\text{value} = \log_{2}\left(\frac{\text{rate} + 0.01}{1 - \text{rate} + 0.01}\right). $$
(11)
Adjusted Rand Index The Adjusted Rand Index (ARI) is a measure of the similarity between two data clusterings:
$$ ARI = \frac{\sum_{ij} \binom{n_{ij}}{2} - \left[\sum_{i} \binom{\alpha_{i}}{2}\sum_{j}\binom{\beta_{j}}{2} \right] / \binom{n}{2}}{\frac{1}{2} \left[ \sum_{i} \binom{\alpha_{i}}{2} + \sum_{j}\binom{\beta_{j}}{2}\right] - \left[\sum_{i} \binom{\alpha_{i}}{2}\sum_{j}\binom{\beta_{j}}{2} \right] / \binom{n}{2}}. $$
(12)
DeepCpG
The DeepCpG method takes a different imputation approach: it is trained on a specific set of chromosomes and predicts methylation states on the remaining chromosomes where it imputes each CpG site sequentially by using as input a set of neighboring CpG sites. This approach makes it difficult to equally compare with the rival methods, since for each CpG the input features to DeepCpG are all the neighboring sites, whereas the competing models have access to a subset of the data and they make predictions in one pass for the whole region. Since we only had access to CpG methylation data and to make it comparable with the considered methods, we trained the CpG module of DeepCpG (termed DeepCpG CpG in [24]).
For the scM&T-seq data set, chromosomes 3 and 17 were used as training set, chromosomes 12 and 14 as validation set and the remaining chromosomes as test set. For the scBS-seq data set, chromosomes 3, 17, and 19 were used as training set; chromosomes 12 and 14 as validation set; and the remaining chromosomes as test set. The chosen chromosomes had at least three million CpGs used as training set, a sensible size for the DeepCpG model as suggested by the authors. A neighborhood of K = 20 CpG sites to the left and the right for each target CpG was used as input to the model. During testing time, even if a given genomic region did not contain at least 40 CpGs, the DeepCpG model used additional CpGs outside this window to predict methylation states, hence using more information compared to the rival models. In total, the DeepCpG model took around 4 days per data set for training and prediction on a cluster equipped with NVIDIA Tesla K40ms GPUs.