A probabilistic generative model for quantification of DNA modifications enables analysis of demethylation pathways

We present a generative model, Lux, to quantify DNA methylation modifications from any combination of bisulfite sequencing approaches, including reduced, oxidative, TET-assisted, chemical-modification assisted, and methylase-assisted bisulfite sequencing data. Lux models all cytosine modifications (C, 5mC, 5hmC, 5fC, and 5caC) simultaneously together with experimental parameters, including bisulfite conversion and oxidation efficiencies, as well as various chemical labeling and protection steps. We show that Lux improves the quantification and comparison of cytosine modification levels and that Lux can process any oxidized methylcytosine sequencing data sets to quantify all cytosine modifications. Analysis of targeted data from Tet2-knockdown embryonic stem cells and T cells during development demonstrates DNA modification quantification at unprecedented detail, quantifies active demethylation pathways and reveals 5hmC localization in putative regulatory regions. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0911-6) contains supplementary material, which is available to authorized users.

(a) The model in the plate notation for generating data. Altogether, 20 control cytosines are generated for each of the three cytosine modifications. Observable data is generated for a given condition with predefined methylation pattern distribution (αs) with or without replicates. Each cytosine (control and non-control) has 96 BS-seq and 96 oxBS-seq read-outs. (b) The process of generating (c) and (d). Two different settings are considered: with more (on left) or less (on right) biological variation between replicates. First, the replicate-specific methylation levels and experimental parameters are sampled from beta and Dirichlet distributions, respectively, and then these are used for generating observed data. Given the data, the posterior distribution of methylation proportions is estimated simultaneously together with the experimental parameters. The aforementioned procedure is repeated 100 times. (c) The results when the distribution with more biological variation (on left in (b)) is used to sample replicate-specific methylation levels. To allow a comparison between the ground truth and the estimates the following procedure was used. For each of the 100 simulations, we estimate the posterior mean of the parameter gµ+1 reflecting variability in θs. Using the estimated parameter we define the distribution Dir(gµ+1). The ternary plots show the average of these distributions over 100 random simulations. (d) Same as (c) but here the replicate-specific methylation levels are sampled from the distribution with less biological variation (on right in (b)). (e) The difference between the estimated distributions and the true distributions is quantified using the Kullback-Leibler divergence, i.e., how much information is lost if we use our estimated distribution instead of the true distribution (an asymmetric measure) as a function of the number of replicates. The Kullback-Leibler divergence [2] is calculated between the true distribution with known parameters, Dir(α), and estimated distribution with the posterior mean of gµ+1, Dir(gµ+1). The boxplots are derived from 100 random simulations.

Supplemental Figure 8. Detecting differential methylation using real and in silico data. (a)
The model in the plate notation for generating data. Altogether, 20 control cytosines are generated for each of the three cytosine modifications. Observable data is generated for each of the condition with predefined methylation pattern distribution (αs) with or without replicates.
Each non-control cytosine has 12 BS-seq and 12 oxBS-seq read-outs, whereas each control cytosine has 96 BS-seq and 96 oxBS-seq read-outs. The experimental parameter values are randomly generated from beta distributions using the indicated parameter values. (b) The differential methylation between conditions A and B is detected by quantifying evidence in the data for the alternative hypothesis H 1 : Δµ=µ A -µ B ≠0 over the null hypothesis H 0 : Δµ=µ A -µ B =0 (see Methods). The evidence is measured using the Bayes factor (BF), which is approximated using the Savage-Dickey density ratio p(Δµ=0|H 1 )/p(Δµ=0|H 1 ,D) (see Methods). The evidence for differential methylation (log 10 BF) between a pair of samples (row/ column) is studied as a function of the number of replicates. Outliers are not depicted. (c) A comparison of Lux, FET, and MOABS in detecting differential methylation. For this purpose we down-sampled the full data set to 30X coverage for each of the three replicates. BS-seq and oxBS-seq data sets were analyzed separately with FET and MOABS for differential methylation. All the covered cytosines in CpG context (N=384) were divided into sets of differentially (N=252) and similarly (N=132) methylated cytosines based on independent CMS-IP and MeDIP loci-level information (see 12X, 30X and "full data" using the sequencing data from WT and Tet2 KO mESCs. For comparison with other previous methods these graphs can also be compared directly with Fig.   3b and (c). (e) A comparison of Lux, FET, and MOABS in detecting differential methylation. For this purpose we simulated data from variably differentially methylated (N=100) and similarly methylated cytosines (N=100) with three replicates and variable bisulphite conversion and oxidation efficiencies and controls. For each noncontrol cytosine 12 BS-seq and 12 oxBS-seq read-outs are generated, whereas each control cytosine (20 per cytosine modification) has 96 BS-seq and 96 oxBS-seq read-outs. Then, the goal is to discriminate the differentially methylated cytosines from similarly methylated cytosines. Each cytosine is analyzed separately.
In the FET and MOABS analyses the BS-seq and oxBS-seq data are analyzed separately and no control data is taken into account. (f) As in (b) but here the detection of differential methylation is studied between methylated (p(5mC)=0.97 in v6.5) and unmethylated (p(C)=0.96 in Tetkd) cytosines at chr4:136,547,567 and chr8:120,115,720, respectively. Only the case of one replicate is considered. (g) As in (f) but here the analysis is done for the cytosine at chr4:139,783,857. The effect of the number of reads and replicates on Bayes factor is studied.
The boxplots are derived from 100 randomly subsampled data sets. for locus-based analysis represented with the plate notation. In the plate notation, the grey and white circles are used to represent observed variables and latent variables, respectively. The grey squares represent fixed parameters. Variation in methylation across a locus is modeled hierarchically by first defining a condition specific mean µ for methylation proportions in a locus, and µ is assigned a Dirichlet prior with hyperparameters α=(0.8, 0.8, 0.8). Methylation proportions ν over individual cytosines within a locus are defined to follow Dir(gµ+1) distribution, where g represents biological variation around µ and was given a gamma prior with the shape parameter a=2 and rate parameter b=2/6. The vector 1 is added in order to prevent concentration of the probability mass in a few components. Finally, replicate specific methylation proportions θ are defined to follow Dir(fν +1) distribution, where f represents variation around ν and was given a gamma prior with the shape parameter a=2 and rate parameter b=2/6. The rest of the model is presented in Suppl.  Table 1). That is, for each of the eight hyperhyperparameter related to BS eff , BS* eff , ox eff and seq err (the upper right corner of Suppl.