BayMeth: improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach
 Andrea Riebler^{1, 2, 3}Email author,
 Mirco Menigatti^{4},
 Jenny Z Song^{5},
 Aaron L Statham^{5},
 Clare Stirzaker^{5, 6},
 Nadiya Mahmud^{7},
 Charles A Mein^{7},
 Susan J Clark^{5, 6} and
 Mark D Robinson^{1, 8}Email author
DOI: 10.1186/gb2014152r35
© Riebler et al.; licensee BioMed Central Ltd. 2014
Received: 11 November 2013
Accepted: 11 February 2014
Published: 11 February 2014
Abstract
Affinity capture of DNA methylation combined with highthroughput sequencing strikes a good balance between the high cost of whole genome bisulfite sequencing and the low coverage of methylation arrays. We present BayMeth, an empirical Bayes approach that uses a fully methylated control sample to transform observed read counts into regional methylation levels. In our model, inefficient capture can readily be distinguished from low methylation levels. BayMeth improves on existing methods, allows explicit modeling of copy number variation, and offers computationally efficient analytical mean and variance estimators. BayMeth is available in the Repitools Bioconductor package.
Background
DNA methylation (DNAme) is a critical component in the regulation of gene expression, is precisely controlled in development and is known to be aberrantly distributed in many diseases, such as cancer and diabetes [1, 2]. In differentiated cells, DNAme occurs primarily in the CpG dinucleotide context. For CpGislandassociated promoters, increases in DNAme (i.e. hypermethylation) induce repression of transcription, while hypomethylated promoters may be transcriptionally active. In cancer, tumor suppressor gene promoters are frequently hypermethylated, and therefore silenced, while hypomethylation can activate oncogenes, which collectively can drive disease progression [3, 4]. The detection and profiling of such abnormalities across cell types and patient cohorts is of great medical relevance, both for our basic understanding of how the disease manifests but also for the opportunities of translating this knowledge to the clinic [5]. Epigenetic patterns can be used as diagnostic markers, predictors of response to chemotherapy and for understanding mechanisms of disease progression [6–9]. Acquired epigenetic changes are potentially reversible, which provides important therapeutic opportunities; notably, the US Food and Drug Administration has approved at least four epigenetic drugs and others are in latestage clinical trials [8].
Four classes of methods are available to profile DNAme genomewide: chemical conversion, endonuclease digestion, direct sequencing and affinity enrichment. Combinations of techniques are also in use, such as reduced representation bisulfite sequencing (RRBS) [10]. For recent reviews of the available platforms, see [11–13]. Treatment of DNA with sodium bisulfite is the gold standard, giving a singlebase readout that preserves methylated cytosines while unmethylated cytosines are converted to uracil [14]. This approach can be coupled with highthroughput sequencing, e.g. whole genome bisulfite sequencing (WGBS), or a ‘genotyping’ microarray (e.g. Illumina Human Methylation 450k array, San Diego, USA [15]). Because WGBS is genomewide, it inefficiently reveals the methylation status for low CpG density regions [16] and is costlimiting for larger cohorts; however, recent statistical frameworks allow coverage to be traded for replication [17] and sequencing targeted regions may be a plausible way to increase efficiency [18, 19]. Meanwhile, Illumina arrays cover less than 2% of genomic CpG sites and are only available for profiling human DNA, while enzymatic digestion approaches are limited by the location of specific sequences. There is considerable excitement surrounding thirdgeneration sequencing technologies that directly infer methylation status, but these are not yet readily available and generally offer lower throughput [20, 21].
An attractive alternative that provides a good tradeoff between cost and coverage, albeit at lower resolution, is affinity capture of methylated DNA in combination with highthroughput sequencing (e.g. methylated DNA immunoprecipitation sequencing (MeDIPseq) [6, 22]). Using affinity capture with antibodies to 5methylcytosine or methylCpG binding domainbased (MBD) proteins, subpopulations of methylated DNA are captured, prepared, sequenced and mapped to a reference genome (see Laird [11]). Åberg et al. [23] studied the use of MBD sequencing (MBDseq) for methylomewide association studies with 1,500 case–control samples, and proved the potential of MBDseq as a costeffective tool in largescale disease studies. A recent comparative study highlighted that affinity capture methods can uncover a significantly larger fraction of differentially methylated regions than the Illumina 450k array [24]. With appropriate normalization, the density of mapped reads can be transformed to a quantitative readout of the regional methylation level. However, the capability of these procedures to interrogate a given genomic region is largely related to CpG density, which influences the efficiency of capture and can differ from protocol to protocol [16, 25, 26]. Thus, statistical approaches are needed.
Several methods have been proposed to estimate DNAme from affinitybased DNAme data. For example, MBDisolated genome sequencing, a variant of MBDseq, assumes a constant rate of reads genomewide and uses a single threshold to binarize as methylated or not [27]. Stateoftheart methods, such as Batman [22] and MEDIPS [28], build a linear model relating read density and CpG density, which is then used to normalize the observed read densities. For MeDIPseq data, the algorithms had similar estimation performance [28], though MEDIPS was considerably more timeefficient. A new tool called BALM uses deep sequencing of MBDcaptured populations and a biasymmetricLaplace model to provide CpGspecific methylation estimates [29]. All methods, however, suffer from the same limitations: low capture efficiency cannot easily be distinguished from low methylation level; and, other factors that directly affect read density, such as copy number variation (CNV), are not easily taken into account. For CNV correction, a few possibilities have emerged, such as omitting known regions of amplification [6], adjusting read densities manually [30] and adjusting using the read density from an input sample [29]. Very recently, a method based on combining profiles from MeDIP/MBDseq and methylationsensitive restriction enzyme sequencing for the same samples with a computational approach using conditional random fields appears promising [31].
We present a novel empirical Bayes model called BayMeth, based on the Poisson distribution, that explicitly models (affinity capture) read densities of a fully methylated control (e.g. DNA treated with SssI CpG methyltransferase) together with those from a sample of interest. Here, SssI data provide the model an awareness of where in the genome the assay can detect DNAme and the model allows integration of CNV and potentially other estimable factors that affect read density. We have derived an analytic expression for the mean methylation level and also for the variance. Interval estimates, such as credible intervals, can be computed using numerical integration of the analytical posterior marginal distributions. Using MBDseq for human lung fibroblast (IMR90) DNA, where ‘true’ methylation levels are available from WGBS, we found favorable performance compared to existing approaches in terms of bias, meansquared error, Spearman correlation and coverage probabilities. We found that improved performance can even be observed when ignoring SssI data. Modelbased SssI correction, however, does not only lead to better performance, but, in addition, data originating from different capture platforms can be compared more easily by propagating the platformspecific uncertainty. Using MBDseq data for human prostate carcinoma (LNCaP) cells, we showed that directly integrating CNV data provides additional performance gains. The performance with historical data, where no matched SssI sample is available, was demonstrated using data for embryonic stem cells, and colon tumor and normal samples from [32].
Results and discussion
BayMeth: A Bayesian framework for translating read densities into methylation levels
with λ_{ i } > 0 and 0 < μ_{ i } < 1. Here, λ_{ i } denotes the regionspecific read density at full methylation, μ_{ i } the regional methylation level and f > 0 represents the (effective) relative sequencing depth between libraries (i.e. a normalization offset). An approximately linear relation between the copy number state and MBDseq read density has been established [33]. Hence, if needed, we include a multiplicative offset cn_{ i }/ccn in our model formulation, where cn_{ i } denotes the copy number state at region i and ccn is a cell’s most prominent CNV state (e.g. two in normal cells).
Closedform posterior methylation quantities
where _{2}F_{1}() is the Gauss hypergeometric function (see page 558 of [35]). The posterior mean and the variance are analytically available (see Additional file 1) and therefore efficient to compute. Credible intervals, which are quantilebased or use the highest posterior density (HPD), can be computed from equation (3). Wald credible intervals are computed on the logit scale, where logit(μ_{ i }) = log(μ_{ i } / (1  μ_{ i })), and then transformed back. These intervals are based on assuming asymptotic normality of the logit methylation estimate. The 95% Wald interval on the logit scale is computed from $logit\left({\widehat{\mu}}_{i}\right)\pm 1.96\xb7{\widehat{\sigma}}_{i}$, where ${\widehat{\sigma}}_{i}$ is the standard error estimate of $\text{logit}\left({\widehat{\mu}}_{i}\right)$. For detailed statistical derivations, also including more general prior distributions for μ_{ i }, refer to Additional file 1.
Empirical Bayes for prior hyperparameter specification
Our method takes advantage of the relation between CpG density and read depth to formulate a CpGdensitydependent prior distribution for λ_{ i } (and possibly unknown parameters in the prior distribution of μ_{ i }). Taking CpG density into account, the prior should stabilize the methylation estimation procedure for low counts and in the presence of sampling variability. All unknown hyperparameters are determined in a CpGdensitydependent manner using empirical Bayes. For each genomic bin of a predetermined size, e.g., 100 bp, we determine the weighted number of CpG dinucleotides within an enlarged window, say 700 bp, around the center of the bin (see Materials and methods and MEDME [36]). Each region is classified based on its CpG density into one of K( = 100) nonoverlapping CpG density intervals (see xaxis tick marks in Additional file 2: Figure S1).
SssIfree BayMeth
Although we recommend collecting at least a single SssI sample under the same protocol as the data of interest, BayMeth can, in principle, be run without a SssIcontrol sample. The statistical framework then only involves the Poisson model for the sample of interest (equation (1)) and no longer borrows strength from the information included in the SssIcontrol sample (equation (2)). The same model is used in the analysis of underreported count data in economics [34, 38, 39], where it is assumed that the number of registered purchase events underreports the actual purchase rate. According to Fader and Hardie [34], the parameters λ_{ i } and μ_{ i } are identifiable assuming that the gamma and beta prior distributions are able to capture unobserved heterogeneity in the read density rate and the methylation level. As in the framework with SssI data, parameters for the gamma prior distributions of the regionspecific read density λ_{ i } can still be determined in a CpGdensitydependent manner using empirical Bayes; however, no information can be borrowed from the fully methylated control. Furthermore, the determination of the normalizing offset f is more involved. Interpretation moves from the (effective) relative sequencing depth between libraries to the number of bins potentially ‘at risk’ of being methylated in the sample of interest. Here, we fix f at the 99% quantile of the number of reads. The results for the posterior mean and variance of the methylation level change accordingly (see Additional file 1).
Analysis of affinity capture methylation data with a matched SssI sample
For the following, we used BayMeth to affinity capture methylation data. We collected a SssIcontrol sample under the same conditions (e.g. same elution protocol) used for the samples of interest. Hence, both data components are matched.
BayMeth improves estimation and provides realistic variability estimates
To take advantage of the singlebaseresolution highcoverage methylome obtained using WGBS by Lister et al. [40], we generated IMR90 MBDseq data under the same protocol as our previously published SssI MBDseq dataset [37], i.e. using a single fraction with a high salt elution buffer (MethylMiner™). We applied BayMeth to chromosome 7, which consists of 1,588,214 nonoverlapping bins of width 100 bp. Only bins with at least 75% mappable bases were included, so we analyzed 1,221,753 bins (approximately 77%).
We ran BayMeth in two configurations: (1) incorporating SssI information and assuming a uniform prior between zero and one for the methylation parameter and (2) ignoring SssI information and assuming a DiracBetaDirac mixture prior distribution for the methylation parameter. That means we set a point mass on zero and on one, giving each a prior weight of 10%. The parameters of the central beta component were assumed to be unknown. The normalizing offset f = 0.581 for configuration 1 was found by calculating a scaling factor between highly methylated regions in IMR90 relative to the SssI control (see Materials and methods and Additional file 2: Figure S2). The prior parameters for the gamma distributions and the parameters of the beta distribution in configuration 2 were determined by empirical Bayes, as discussed above (see also further details in Materials and methods).
We compared the results from BayMeth, both ignoring and taking advantage of the SssI control, to those obtained from Batman [22], MEDIPS [28] and BALM [29]. To provide plausible uncertainty estimates with Batman, we increased the default number of generated samples from 100 to 500. The WGBS data, here considered to be the ‘truth’ (at suitable depth), and the CpGspecific BALM methylation estimates were collapsed into 100bp bin estimates (see Materials and methods) to match the estimates from MEDIPS, Batman and our approach. For about 53% (645,451) of the analyzed bins, no WGBS data were available (largely due to the lack of CpG sites). For 17,259 bins, no methylation estimates were provided by Batman, so that in total, algorithm comparisons were conducted on the remaining 559,043 bins.
Performance assessment for IMR90 analysis (chromosome 7)
SssI depth  Number of bins  Method  Mean bias  Mean of squared differences  Spearman correlation  Wald  Highest posterior density  Quantile 

[ 0,4]  305,638  BayMeth  0.04  0.08  0.36  0.74  0.89  0.89 
BayMeth (SssIfree)  0.19  0.20  0.23  —  —  0.24  
Batman  0.22  0.14  0.31  —  —  0.43  
MEDIPS  0.38  0.26  0.29  —  —  —  
BALM  0.48  0.33  0.32  —  —  —  
(4,7]  22,196  BayMeth  0.05  0.05  0.65  0.84  0.88  0.87 
BayMeth (SssIfree)  0.01  0.08  0.42  —  —  0.68  
Batman  0.16  0.07  0.61  —  —  0.34  
MEDIPS  0.23  0.11  0.45  —  —  —  
BALM  0.27  0.15  0.60  —  —  —  
(7,14]  28,871  BayMeth  0.06  0.04  0.69  0.84  0.86  0.86 
BayMeth (SssIfree)  0.02  0.05  0.57  —  —  0.79  
Batman  0.16  0.07  0.65  —  —  0.28  
MEDIPS  0.21  0.10  0.49  —  —  —  
BALM  0.21  0.11  0.66  —  —  —  
(14,27]  28,928  BayMeth  0.05  0.03  0.76  0.81  0.85  0.82 
BayMeth (SssIfree)  0.08  0.04  0.72  —  —  0.70  
Batman  0.15  0.06  0.73  —  —  0.23  
MEDIPS  0.20  0.09  0.59  —  —  —  
BALM  0.15  0.07  0.75  —  —  —  
(27,168]  28,719  BayMeth  0.02  0.03  0.79  0.73  0.86  0.78 
BayMeth (SssIfree)  0.11  0.04  0.77  —  —  0.48  
Batman  0.11  0.05  0.75  —  —  0.20  
MEDIPS  0.22  0.10  0.67  —  —  —  
BALM  0.14  0.06  0.76  —  —  — 
For the same stratification, Additional file 2: Table S1 shows the mean bias for BayMeth, Batman, MEDIPS and BALM. While the latter two had a low mean bias for bins where the truth was within [ 0,0.2], Batman performed best for highly methylated bins. BayMeth performed well for bins where the true methylation level was intermediate or high. Like Batman, reasonable estimates were obtained over the whole range of methylation states when considering bins in CpG islands. When interpreting the mean bias, the uncertainty around the obtained estimates should be taken into account and hence the results should be set into context with Figure 4. Combining bias and calibration, BayMeth performed well and seems to be better than the existing approaches.
Copy number variationaware BayMeth improves estimation of DNA methylation for prostate cancer cells
Copy number specific offset
1  2  3  4  5  6  7  8  

Combined offset  0.178  0.356  0.534  0.712  0.889  1.067  1.245  1.423 
Performance assessment for LNCaP analysis by copy number
Copy number  Number of bins  Method  Mean bias  Mean of squared differences  Spearman correlation 

2  18,010  BayMeth  0.04  0.04  0.78 
BayMeth (SssIfree)  0.08  0.05  0.79  
BayMeth (CNVunaware)  0.11  0.06  0.78  
BayMeth (SssIfree, CNVunaware)  0.05  0.05  0.79  
Batman  0.03  0.06  0.74  
MEDIPS  0.23  0.11  0.76  
BALM  0.29  0.16  0.78  
3  65,982  BayMeth  0.05  0.04  0.80 
BayMeth (SssIfree)  0.09  0.05  0.80  
BayMeth (CNVunaware)  0.01  0.04  0.80  
BayMeth (SssIfree, CNVunaware)  0.05  0.04  0.80  
Batman  0.11  0.06  0.77  
MEDIPS  0.19  0.09  0.76  
BALM  0.20  0.10  0.79  
4  256,074  BayMeth  0.05  0.04  0.81 
BayMeth (SssIfree)  0.10  0.05  0.81  
BayMeth (CNVunaware)  0.05  0.04  0.81  
BayMeth (SssIfree, CNVunaware)  0.11  0.06  0.81  
Batman  0.16  0.08  0.79  
MEDIPS  0.17  0.09  0.76  
BALM  0.12  0.07  0.80  
5  11,790  BayMeth  0.04  0.03  0.83 
BayMeth (SssIfree)  0.07  0.05  0.82  
BayMeth (CNVunaware)  0.09  0.04  0.83  
BayMeth (SssIfree, CNVunaware)  0.12  0.06  0.82  
Batman  0.18  0.08  0.80  
MEDIPS  0.12  0.07  0.80  
BALM  0.08  0.05  0.82 
Improved correlation across methylation kits on IMR90 DNA
Analysis of affinity capture methylation data without a matched SssI sample
Next, we applied (default) BayMeth to the MethylCap sequencing data of [32], provided at [43], and referred to as the ‘Bock’ data. Absolute read densities were available for (nonoverlapping) 50bp bins for four samples: HUES6 ES cell line, HUES8 ES cell line, colon tumor tissue and colon normal tissue (same donor as for the colon tumor tissue). There were no matched SssI samples available for these data. To take advantage of BayMeth in analyzing these data, we used a nonmatching SssI sample, but one chosen to be maximally compatible to the preparation conditions of the Bock data [32] (i.e. MethylCap with a low salt concentration of 200 mM NaCl). Regions from the data available were converted to hg19 coordinates using liftOver (see Additional file 3 for details). Although there were still slight differences in the preparation of the samples of interest and the SssI sample, which arise from the different read lengths (36 bp versus 75 bp, respectively) and read extensions (300 bp versus 150 bp, respectively) used before the read frequencies were calculated, we regard the SssI sample as a reasonably suitable control for running BayMeth. We analyzed all autosomes after removing bins that have no read depth in any of the four samples, leading to 42,955,764 bins. As in the previous analyses, we restricted our attention to bins that have at least 75% mappable bases, of which there were 37,013,409 or 86% of all bins. A detailed description of all data preparation steps and the data analysis using BayMeth based on the R package Repitools is given in Additional file 3. We compared the methylation estimates obtained using BayMeth with RRBS data available from the Bock study [32]. As in the methylation kit analysis, we masked unusual high counts in the derivation of the prior parameters as they sometimes cause problems in the numerical optimization routine; however, methylation estimates were derived for more than 99.5% of these masked bins. Interestingly, several high count regions could be explained by unannotated high copy number regions; see Pickrell et al. [42].
Discussion
DNAme plays a crucial role in various biological processes and is known to be aberrant in several human diseases, such as cancer. There are now a multitude of methylation profiling platforms, each with inherent advantages and disadvantages. Bisulfitebased approaches are considered the gold standard since they allow quantification at singlebase resolution. However, applied genomewide, this technique can be inefficient and expensive, in terms of CpGs covered per read or base sequenced [5, 16]. On the other hand, approaches based on affinity capture, such as MBD and MeDIP, combined with sequencing seem to provide a good compromise between cost and coverage, albeit at lower resolution. Thus, we consider MBDseq and its variants to be an attractive alternative and have developed an efficient data analytic approach to facilitate their use. In addition, MBDseq has recently been used with only hundreds of nanograms of starting DNA, thus making it applicable to a wider range of studies, such as clinical samples [44].
The key to our proposed method is the use of methylated DNA captured from a fully methylated SssI control. To facilitate accurate transformation of read counts into methylation, we recommend this sample is collected under the same conditions (e.g. same elution) used for the samples of interest. In our analyses, we used commercially available SssItreated DNA [26, 37] for the MBDseq experiments and used the 450k platform to verify that the overwhelming majority of CpG sites were indeed methylated (see Additional file 2: Figure S6). Similarly, such a sample can be constructed directly and inexpensively [45].
Our proposed method, BayMeth, is a flexible empirical Bayes approach, which transforms read densities into regional methylation estimates. Our model is based on a Poisson distribution and takes advantage of SssI control data in two ways: (i) we model SssI data jointly with data from a sample of interest to preserve the linearity of the methylation estimation and (ii) we explicitly get information about the regionspecific read density as a function of CpG density. Our method is similar in principle to MEDME, which was applied to fully methylated MeDIP microarray intensities [36]. However, our approach necessarily modifies the assumptions for count data (i.e. read densities versus probe intensities) and is effectively a compromise between the global fit that MEDME implements and a regionspecific correction. We showed that BayMeth performed better than stateoftheart techniques for MBDseq data, using multiple datasets where independent true methylation levels were available from WGBS or bisulfitebased methylation arrays. In general, MEDIPS and BALM underestimate the methylation levels and do not offer variability estimates. Batman performs reasonably well, but our results suggest that its variability estimates are generally underestimated and the method is very computationally demanding. Our model performed best in point estimation and was the only method to provide reasonable interval estimates. BayMeth uses analytic expressions for the posterior marginal distribution and the posterior mean and variance, avoiding computationally expensive sampling algorithms. Furthermore, we can explicitly integrate existing CNV data, which gives an improvement when applied to cancer datasets. CNV adjustments may be possible with existing approaches such as Batman or MEDIPS, based on ad hoc transformations of the read counts (e.g. see [30]), but are not included within the model formulation. In contrast, our model preserves the count nature of the data. To adjust the modeled mean for effects arising due to library composition or CNV, we introduced a normalization offset. This strategy is quite general and could be extended beyond composition and CNV (e.g. see [33, 46]).
A conceptually similar Bayesian hierarchical model, which uses MCMC sampling, has been proposed for methylseq experiments where methylation levels are derived based on enzymatic digestion using two enzymes [47]. A separate Poisson model is assumed for the tag counts of each enzyme. The models are linked through a shared parameter. One Poisson model contains a methylation level parameter μ, assumed to be uniformly distributed a priori. Our model may find application in this domain. In the applications presented here, a uniform prior distribution for the methylation level was observed to perform best when taking SssI information into account, while a mixed prior of a point mass at zero and at one, combined with a beta distribution, performed best when ignoring SssI information. The analytical expressions for the mean, variance and posterior marginal distribution are also available when using a mixture of beta distributions (see Materials and methods). Therefore, contextspecific information, such as CpG density or the position relative to transcriptional features, could be incorporated into the prior distribution for the methylation level. We have tried various weighted mixtures of two or three beta distributions that build in contextual information; however, these did not outperform the uniform prior when borrowing strength from the SssI sample. The reason probably lies in the fact that there is only one data point for each methylation parameter. Hence, when using an informative prior distribution for the methylation level, it is very difficult for the data to overcome this prior guess.
It is well known that methylation levels are dependent within neighboring regions. Thus, a potential improvement may involve modeling correlation between neighboring genomic bins. One approach might be Gaussian Markov random fields [48]; however, the analytical summaries are lost, so the gain in performance may not justify the more complex model and associated computational cost.
BayMeth may also be regarded as a preprocessing step in differential methylation analysis. The uncertainty in methylation estimates obtained by BayMeth could be propagated to a downstream analysis, which may lead to improved inferences on differential methylation.
Conclusions
BayMeth is an empirical Bayes approach that uses a fully methylated (SssI treated) control sample to transform observed read counts into regional methylation levels. BayMeth can be applied to methylated DNA affinity enrichment assays (e.g MBDseq, MeDIPseq) and improves on existing methods. Inefficient capture can readily be distinguished from low methylation levels by means of larger posterior variances. Furthermore, copy number variation data can be explicitly integrated, which offers improvement when applied to cancer datasets. Notably, BayMeth offers computationallyefficient analytic expressions for the mean and variance of the methylation level. A software implementation is freely available in the Bioconductor Repitools package.
Materials and methods
Methyl binding domain sequencing data for IMR90, LNCaP and SssI DNA
We used LNCaP and SssI MBDseq data and Affymetrix genotyping array data (LNCaP only) from Robinson et al. [37]. The data can be found at [49] under accession number [GEO:GSE24546]. Similarly, IMR90 MBDseq data are available from [GEO:GSE38679]. Details of DNA capture, preparation and sequencing can be found in Robinson et al. [26, 37].
Illumina HumanMethylation450 data for IMR90, LNCaP and SssI DNA
IMR90 and SssI DNA was processed by the Illumina HumanMethylation450 platform using the standard Illumina protocol. The raw (IDAT) and processed files are available at [49] under accession number [GEO:GSE54375]. The 450k array data for LNCaP cells originated from Robinson et al. [37] and can be found at [49] under accession number [GEO:GSE34340].
Methyl binding domain sequencing and methylated DNA immunoprecipitation sequencing for comparing data from different methylation kits
For comparing data obtained by different methylation kits, we captured methylated DNA from IMR90 and SssI DNA as follows. Genomic DNA was sheared to 150 to 200 bp using the Covaris S220 sonicator. MBDs were captured using the MethylMiner Methylated DNA Enrichment Kit (Invitrogen, Carlsbad, USA) and the MethylCap Kit (Diagenode, Liege, Belgium) following the manufacturers’ recommended protocols. The bound fractions were eluted at 500 mM and 1 M NaCl for MethylMiner and with buffers at different salt concentrations (low, medium, high) for MethylCap. Sequencing libraries were prepared with the SOLiD Fragment Library Construction Kit (Applied Biosystems, Foster City, USA). MeDIPseq methylation immunocapture and library preparation were performed using the MeDIP Kit (Active Motif, Carlsbad, USA) following the manufacturer’s recommended protocol.
Calculation of CpG density
CpG density is defined to be a weighted count of CpG sites in a predefined region. We used the function cpgDensityCalc provided by the Rpackage Repitools[50] to get binspecific CpG density estimates using a linear weighting function and a window size of 700 bp (since we expect fragments around 300 bp).
Calculation of mappability
Using Bowtie, all possible 36bp reads of the genome were mapped back against the hg18 reference, with no mismatches. At each base, a read can either unambiguously map or not. A mappability estimate gives the proportion of reads that can be mapped to a specific regions. To get binspecific mappability estimates, we used the function mappabilityCalc in the Repitools package [50]. In our analysis, a window of 500 bp was used (250 bp upstream and downstream from the center from each 100bp bin) and the percentage of mappable bases was computed. For the methylation kits analysis we used mappability estimates for hg19 provided by ENCODE on [51], from which we derived a weighted mean based on the window size. Analogously, we used [52] for the Bock data analysis.
Derivation of regionspecific methylation estimates from whole genome bisulfite sequencing
Here, $\sum _{j\in \mathcal{\mathcal{B}}}\left({r}_{j}^{+}+{r}_{j}^{}\right)$ is termed the depth.
Derivation of regionspecific methylation estimates from 450K arrays
First, the Illumina HumanMethylation450 methylation array was preprocessed using the default parameters of the minfi package [53], version 1.3.3. For each sample, a vector of beta values, one for each targeted CpG site representing methylation estimates, is produced. To obtain (100 bp) binspecific methylation profiles, we average beta values from all CpG sites within 100 bp (upstream and downstream; total window of 200 bp) from the center of our 100bp bins.
Derivation of regionspecific methylation estimates from reduced representation bisulfite sequencing data
That means using information for all CpG sites that fall into bin i.
Determining the normalizing offset
The composition of a library influences the resulting read densities [54]. For example, the SssI control is a more diverse set of DNA fragments since it captures the vast majority of CpGrich regions in the genome. Therefore, if the total sequencing depth were to be fixed, one would expect a relative undersampling of regions in SssI, compared to a sample of interest that is presumably largely unmethylated. To adjust the modeled mean (in the Poisson model) for these composition effects, we estimate a normalizing factor f that accounts simultaneously for overall sequencing depth and composition. Additional file 2: Figure S2 shows an M (logratio) versus A (averagelogcount) plot for 50,000 randomly chosen (100 bp) bins for IMR90 compared to the fully methylated control. A clear offset from zero is visible, since the distribution of M values is skewed in the negative direction. The normalization offset is estimated as $f={2}^{\text{median}\left({M}_{A>q}\right)}$, with q corresponding to a high quantile of A (here, 0.998; more than 35,000 points). In cancer samples where CNV is common, the normalization factor f is calculated from bins that originate from the most prominent copy number state (e.g., ccn = 4 in LNCaP cells).
Estimation of copy number
Copy numbers were estimated from Affymetrix SNP6.0 genotyping array data by PICNIC [55], using default parameters. PICNIC is an algorithm based on a hidden Markov model to produce absolute allelic copy number segmentation.
BayMeth methodology
 1.
An empirical Bayes procedure to derive sensible prior parameters for all parameters in the model.
 2.
The analytical derivation of the posterior marginal distribution, posterior expectation and variance for the methylation levels. Credible intervals are derived numerically from the posterior marginal distribution.
The details for both steps are provided in Additional file 1. In practice BayMeth can be used almost as a black box within the Bioconductor package Repitools[50].
Batman specification
Batman is an algorithm implemented in Java and run from the command prompt. The original Batman can be downloaded from [56]. We used an unreleased version, 20090617, received directly from Thomas Down, which had MeDIPseqspecific enhancements. The commands used to run Batman are given on the supplementary website [57].
MEDIPS specification
We used RBioconductor MEDIPS version 1.4.0. The detailed command sequence is given on the supplementary website [57]. MEDIPS returns methylation estimates in the range from zero to 1,000, which we rescaled to the interval [ 0,1]. In our comparison, we used the absolute methylation score provided by MEDIPS.
BALM specification
BALM is an algorithm implemented in C and C++ and run from the command prompt. The original BALM can be downloaded from [58]. We used version 1.01. The detailed command sequence is given on the supplementary website [57]. BALM returns a vector of methylation estimates, one for each targeted CpG site. To obtain (100 bp) binspecific methylation profiles, we averaged the methylation estimates from all CpG sites within 100 bp (upstream and downstream; total window of 200 bp) from the center of our 100bp bins. For the IMR90 dataset, BALM was run without an input control. To assess the effect of the missing input control, we ran BALM using a sample from a normal human prostate epithelial cell line (PrEC) as input control, which gave almost identical performance results.
Software
BayMeth is fully integrated into the R package Repitools and available from the Bioconductor project. Data (semiprocessed), R code for all figures and analyses are provided on [57].
Abbreviations
 bp:

base pair
 CNV:

copy number variation
 DNAme:

DNA methylation
 HPD:

highest posterior density
 LNCaP:

human prostate carcinoma
 MBD:

methyl binding domain
 MeDIP:

methylated DNA immunoprecipitation
 MSE:

mean of squared differences
 RRBS:

reduced representation bisulfite sequencing
 seq:

sequencing
 WGBS:

whole genome bisulfite sequencing.
Declarations
Acknowledgements
AR gratefully acknowledges funding from Forschungskredit and a grant from URPP (University Research Priority Program in Systems Biology/Functional Genomics) of the University of Zurich. MM acknowledges grant funding from the Swiss Cancer League (KFS02739022011). SJC gratefully acknowledges grant funding from NHMRC and NBCF. MDR acknowledges financial support from a SNSF project grant (143883) and from the European Commission through the 7th Framework Collaborative Project RADIANT (Grant Agreement Number: 305626). We thank Elena Zotenko, Marcel Coolen, Giancarlo Marra, Mattia Pelizzola, Mark van de Wiel and Leonhard Held for useful discussions on experimental, computational and statistical strategies.
Authors’ Affiliations
References
 Jones PA, Baylin SB: The epigenomics of cancer. Cell. 2007, 128: 683692. 10.1016/j.cell.2007.01.029.PubMedPubMed CentralView ArticleGoogle Scholar
 Slomko H, Heo HJ, Einstein FH: Minireview: epigenetics of obesity and diabetes in humans. Endocrinology. 2012, 153: 10251030. 10.1210/en.20111759.PubMedPubMed CentralView ArticleGoogle Scholar
 Clark SJ, Melki J: DNA methylation and gene silencing in cancer: which is the guilty party?. Oncogene. 2002, 21: 53805387. 10.1038/sj.onc.1205598.PubMedView ArticleGoogle Scholar
 Esteller M: Cancer epigenomics: DNA methylomes and histonemodification maps. Nat Rev Genet. 2007, 8: 286298. 10.1038/nrg2005.PubMedView ArticleGoogle Scholar
 Ziller MJ, Gu H, Müller F, Donaghey J, Tsai LTY, Kohlbacher O, De Jager PL, Rosen ED, Bennett DA, Bernstein BE, Gnirke A, Meissner A: Charting a dynamic DNA methylation landscape of the human genome. Nature. 2013, 500: 477481. 10.1038/nature12433.PubMedView ArticleGoogle Scholar
 Ruike Y, Imanaka Y, Sato F, Shimizu K, Tsujimoto G: Genomewide analysis of aberrant methylation in human breast cancer cells using methylDNA immunoprecipitation combined with highthroughput sequencing. BMC Genomics. 2010, 11: 13710.1186/1471216411137.PubMedPubMed CentralView ArticleGoogle Scholar
 Stein RA: Epigenetics – the link between infectious diseases and cancer. J Am Med Assoc. 2011, 305: 14841485. 10.1001/jama.2011.446.View ArticleGoogle Scholar
 Baylin SB, Jones PA: A decade of exploring the cancer epigenome – biological and translational implications. Nat Rev Cancer. 2011, 11: 726734. 10.1038/nrc3130.PubMedPubMed CentralView ArticleGoogle Scholar
 Jones PA: Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012, 13: 484492. 10.1038/nrg3230.PubMedView ArticleGoogle Scholar
 Gu H, Bock C, Mikkelsen TS, Jager N, Smith ZD, Tomazou E, Gnirke A, Lander ES, Meissner A: Genomescale DNA methylation mapping of clinical samples at singlenucleotide resolution. Nature Methods. 2010, 7: 133136. 10.1038/nmeth.1414.PubMedPubMed CentralView ArticleGoogle Scholar
 Laird PW: Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet. 2010, 11: 191203. 10.1038/nrg2732.PubMedView ArticleGoogle Scholar
 Lister R, Ecker JR: Finding the fifth base: genomewide sequencing of cytosine methylation. Genome Res. 2009, 19: 959966. 10.1101/gr.083451.108.PubMedPubMed CentralView ArticleGoogle Scholar
 Kerick M, Fischer A, Schweiger MR: Generation and analysis of genomewide DNA methylation maps. Bioinformatics for High Throughput Sequencing. Edited by: RodríguezEzpeleta N, Hackenberg M, Aransay AM. 2012, New York: Springer, 151167.View ArticleGoogle Scholar
 Clark SJ, Harrison J, Paul CL, Frommer M: High sensitivity mapping of methylated cytosines. Nucleic Acids Res. 1994, 22: 29902997. 10.1093/nar/22.15.2990.PubMedPubMed CentralView ArticleGoogle Scholar
 Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan JB, Shen R: High density DNA methylation array with single CpG site resolution. Genomics. 2011, 98: 288295. 10.1016/j.ygeno.2011.07.007.PubMedView ArticleGoogle Scholar
 Robinson MD, Statham AL, Speed TP, Clark SJ: Protocol matters: which metylome are you actually studying?. Epigenomics. 2010, 2: 587598. 10.2217/epi.10.36.PubMedPubMed CentralView ArticleGoogle Scholar
 Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, McDonald OG, Wen B, Wu H, Liu Y, Diep D, Briem E, Zhang K, Irizarry RA, Feinberg AP: Increased methylation variation in epigenetic domains across cancer types. Nat Genet. 2011, 43: 768775. 10.1038/ng.865.PubMedPubMed CentralView ArticleGoogle Scholar
 Lee EJ, Pei L, Srivastava G, Joshi T, Kushwaha G, Choi JH, Robertson KD, Wang X, Colbourne JK, Zhang L, Schroth GP, Xu D, Zhang K, Shi H: Targeted bisulfite sequencing by solution hybrid selection and massively parallel sequencing. Nucleic Acids Res. 2011, 39: e12710.1093/nar/gkr598.PubMedPubMed CentralView ArticleGoogle Scholar
 Lee EJ, Luo J, Wilson JM, Shi H: Analyzing the cancer methylome through targeted bisulfite sequencing. Cancer Lett. 2013, 340: 171178. 10.1016/j.canlet.2012.10.040.PubMedView ArticleGoogle Scholar
 Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H: Continuous base identification for singlemolecule nanopore DNA sequencing. Nat Nanotechnol. 2009, 4: 265270. 10.1038/nnano.2009.12.PubMedView ArticleGoogle Scholar
 Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW: Direct detection of DNA methylation during singlemolecule, realtime sequencing. Nat Methods. 2010, 7: 461465. 10.1038/nmeth.1459.PubMedPubMed CentralView ArticleGoogle Scholar
 Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Gräf S, Johnson N, Herrero J, Tomazou EM, Thorne NP, Bäckdahl L, Herberth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, Hubbard TJP, Durbin R, Tavaré S, Beck S: A Bayesian deconvolution strategy for immunoprecipitationbased DNA methylome analysis. Nat Biotechnol. 2008, 26: 779785. 10.1038/nbt1414.PubMedPubMed CentralView ArticleGoogle Scholar
 Aberg KA, McClay JL, Nerella S, Xie LY, Clark SL, Hudson AD, Bukszár J, Adkins D, Hultman CM, Sullivan PF, Magnusson PKE, van den Oord EJCG, Swedish Schizophrenia Consortium: MBDseq as a costeffective approach for methylomewide association studies: demonstration in 1500 casecontrol samples. Epigenomics. 2012, 4: 605621. 10.2217/epi.12.59.PubMedPubMed CentralView ArticleGoogle Scholar
 Clark C, Palta P, Joyce CJ, Scott C, Grundberg E, Deloukas P, Palotie A, Coffey AJ: A comparison of the whole genome approach of MeDIPseq to the targeted approach of the Infinium HumanMethylation450 BeadChip ^{Ⓡ} for methylome profiling. PLoS ONE. 2012, 7: e5023310.1371/journal.pone.0050233.PubMedPubMed CentralView ArticleGoogle Scholar
 De Meyer T, Mampaey E, Vlemmix M, Denil S, Trooskens G, Renard JP, De Keulenaer S, Dehan P, Menschaert G, Van Criekinge W: Quality evaluation of methyl binding domain based kits for enrichment DNAmethylation sequencing. PLoS ONE. 2013, 8: e5906810.1371/journal.pone.0059068.PubMedPubMed CentralView ArticleGoogle Scholar
 Nair SS, Coolen MW, Stirzaker C, Song JZ, Statham AL, Strbenac D, Robinson MD, Clark SJ: Comparison of methylDNA immunoprecipitation (MeDIP) and methylCpG binding domain (MBD) protein capture for genomewide DNA methylation analysis reveal CpG sequence coverage bias. Epigenetics. 2011, 6: 3444. 10.4161/epi.6.1.13313.PubMedView ArticleGoogle Scholar
 Serre D, Lee BH, Ting AH: MBDisolated genome sequencing provides a highthroughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Res. 2010, 38: 391399. 10.1093/nar/gkp992.PubMedPubMed CentralView ArticleGoogle Scholar
 Chavez L, Jozefczuk J, Grimm C, Dietrich J, Timmermann B, Lehrach H, Herwig R, Adjaye J: Computational analysis of genomewide DNA methylation during the differentiation of human embryonic stem cells along the endodermal lineage. Genome Res. 2010, 20: 14411450. 10.1101/gr.110114.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Lan X, Adams C, Landers M, Dudas M, Krissinger D, Marnellos G, Bonneville R, Xu M, Wang J, Huang THM, Meredith G, Jin VX: High resolution detection and analysis of CpG dinucleotides methylation using MBDseq technology. PLoS ONE. 2011, 6: e2222610.1371/journal.pone.0022226.PubMedPubMed CentralView ArticleGoogle Scholar
 Feber A, Wilson GA, Zhang L, Presneau N, Idowu B, Down TA, Rakyan VK, Noon LA, Lloyd AC, Stupka E, Schiza V, Teschendorff AE, Schroth GP, Flanagan A, Beck S: Comparative methylome analysis of benign and malignant peripheral nerve sheath tumors. Genome Res. 2011, 21: 515524. 10.1101/gr.109678.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Stevens M, Cheng JB, Li D, Xie M, Hong C, Maire CL, Ligon KL, Hirst M, Marra MA, Costello JF, Wang T: Estimating absolute methylation levels at singleCpG resolution from methylation enrichment and restriction enzyme sequencing methods. Genome Res. 2013, 23: 15411553. 10.1101/gr.152231.112.PubMedPubMed CentralView ArticleGoogle Scholar
 Bock C, Tomazou EM, Brinkman A, Müller F, Simmer F, Gu H, Jäger N, Gnirke A, Stunnenberg HG, Meissner A: Genomewide mapping of DNA methylation: a quantitative technology comparison. Nature Biotechnology. 2010, 28: 11061114. 10.1038/nbt.1681.PubMedPubMed CentralView ArticleGoogle Scholar
 Robinson MD, Strbenac D, Stirzaker C, Statham AL, Song JZ, Speed TP, Clark SJ: Copynumberaware differential analysis of quantitative DNA sequencing data. Genome Res. 2012, 22: 24892496. 10.1101/gr.139055.112.PubMedPubMed CentralView ArticleGoogle Scholar
 Fader PS, Hardie BGS: A note on modelling underreported Poisson counts. J Appl Stat. 2000, 27: 953964. 10.1080/02664760050173283.View ArticleGoogle Scholar
 Abramowitz M, Stegun IA: Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables. 1972, New York: Dover PublicationsGoogle Scholar
 Pelizzola M, Koga Y, Urban AE, Krauthammer M, Weissman S, Halaban R, Molinaro AM: MEDME: an experimental and analytical methodology for the estimation of DNA methylation levels based on microarray derived MeDIPenrichment. Genome Res. 2008, 18: 16521659. 10.1101/gr.080721.108.PubMedPubMed CentralView ArticleGoogle Scholar
 Robinson MD, Stirzaker C, Statham AL, Coolen MW, Song JZ, Nair SS, Strbenac D, Speed TP, Clark SJ: Evaluation of affinitybased genomewide DNA methylation data: effects of CpG density, amplification bias, and copy number variation. Genome Res. 2010, 20: 17191729. 10.1101/gr.110601.110.PubMedPubMed CentralView ArticleGoogle Scholar
 Schmittlein DC, Bemmaor AC, Morrison DG: Why does the NBD model work? Robustness in representing product purchases, brand purchases and imperfectly recorded purchases. Marketing Sci. 1985, 4: 255266. 10.1287/mksc.4.3.255.View ArticleGoogle Scholar
 Winkelmann R: Markov chain Monte Carlo analysis of underreported count data with an application to worker absenteeism. Empirical Economics. 1996, 21: 575587. 10.1007/BF01180702.View ArticleGoogle Scholar
 Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, TontiFilippini J, Nery JR, Lee L, Ye Z, Ngo QM, Edsall L, AntosiewiczBourget J, Stewart R, Ruotti V, Millar AH, Thomson JA, Ren B, Ecker JR: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462: 315322. 10.1038/nature08514.PubMedPubMed CentralView ArticleGoogle Scholar
 Houseman EA, Christensen BC, Karagas MR, Wrensch MR, Nelson HH, Wiemels JL, Zheng S, Wiencke JK, Kelsey KT, Marsit CJ: Copy number variation has little impact on beadarraybased measures of DNA methylation. Bioinformatics. 2009, 25: 19992005. 10.1093/bioinformatics/btp364.PubMedPubMed CentralView ArticleGoogle Scholar
 Pickrell J, Gaffney D, Gilad Y, Pritchard J: False positive peaks in ChIPseq and other sequencingbased functional assays caused by unannotated high copy number regions. Bioinformatics. 2011, 27: 21442146. 10.1093/bioinformatics/btr354.PubMedPubMed CentralView ArticleGoogle Scholar
 DNA Methylation Technology Comparison, Supplementary Website. [http://www.broadinstitute.org/labs/meissner/mirror/papers/methbenchmark/index.html]
 Taiwo O, Wilson GA, Morris T, Seisenberger S, Reik W, Pearce D, Beck S, Butcher LM: Methylome analysis using MeDIPseq with low DNA concentrations. Nature Protocols. 2012, 7: 617636. 10.1038/nprot.2012.012.PubMedView ArticleGoogle Scholar
 Carvalho RH, Haberle V, Hou J, van Gent T, Thongjuea S, van Ijcken W, Kockx C, Brouwer R, Rijkers E, Sieuwerts A, Foekens J, van Vroonhoven M, Aerts J, Grosveld F, Lenhard B, Philipsen S: Genomewide DNA methylation profiling of nonsmall cell lung carcinomas. Epigenetics Chromatin. 2012, 5: 910.1186/1756893559.PubMedPubMed CentralView ArticleGoogle Scholar
 Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNAseq data using conditional quantile normalization. Biostatistics. 2012, 13: 204216. 10.1093/biostatistics/kxr054.PubMedPubMed CentralView ArticleGoogle Scholar
 Wu G, Yi N, Absher D, Zhi D: Statistical quantification of methylation levels by nextgeneration sequencing. PLoS ONE. 2011, 6: e2103410.1371/journal.pone.0021034.PubMedPubMed CentralView ArticleGoogle Scholar
 Rue H, Held L: Gaussian Markov Random Fields: Theory and Applications. 2005, London: Chapman & Hall/CRC PressView ArticleGoogle Scholar
 Gene Expression Omnibus. [http://www.ncbi.nlm.nih.gov/geo]
 Statham AL, Strbenac D, Coolen MW, Stirzaker C, Clark SJ, Robinson MD: Repitools: an R package for the analysis of enrichmentbased epigenomic data. Bioinformatics. 2010, 26: 16621663. 10.1093/bioinformatics/btq247.PubMedPubMed CentralView ArticleGoogle Scholar
 Mapability of Reference Genome from ENCODE (window size 100mer), direct link to bigWig file. [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign100mer.bigWig]
 Mapability of Reference Genome from ENCODE (window size 50mer), direct link to bigWig file. [http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign50mer.bigWig]
 Aryee MJ, Jaffe AE, CorradaBravo H, LaddAcosta C, Feinberg AP, Hansen KD, Irizarry RA: Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA Methylation microarrays. Bioinformatics. 2014, doi:10.1093/bioinformatics/btu049Google Scholar
 Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biology. 2010, 11: R2510.1186/gb2010113r25.PubMedPubMed CentralView ArticleGoogle Scholar
 Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S, Futreal PA, Stratton MR: PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11: 164175. 10.1093/biostatistics/kxp045.PubMedPubMed CentralView ArticleGoogle Scholar
 Bayesian tool for methylation analysis (BATMAN) on Wikipedia, the free encyclopedia. [http://en.wikipedia.org/wiki/Bayesian_tool_for_methylation_analysis]
 BayMeth: improved DNA methylation quantification for affinity capture sequencing data using a flexible Bayesian approach. [http://imlspenticton.uzh.ch/robinson_lab/BayMeth/index.html]
 BALM. [http://motif.bmi.ohiostate.edu/BALM]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.