Mouse epiblast-like P19 embryonal carcinoma cells were treated with retinoic acid (RA) for 48 h to induce their differentiation into neural progenitor-like cells (NPLCs) [5]. Genomic DNA was then fragmented by sonication and 5hmCs were glucosylated in vitro using β-glucosyltranferase and azide-glucose (5gmC, Fig. 1a). Azide then reacted with a biotin conjugate allowing immobilization of the modified DNA (biot-5gmC, Fig. 1a) on streptavidin-coated magnetic beads. After end-polishing and adapter ligation as previously described [19], captured DNA was then treated on beads with 5’-3’ exonuclease. After elution from the beads, samples were processed for subsequent library preparation and Illumina sequencing. Applying SCL-exo to a hydroxymethylated DNA standard (Fig. 1b) revealed that, as expected, a large fraction of sequencing reads started with a C (i.e. 36 % for the forward strand and 38 % for the reverse strand, Fig. 1c). In addition, the number of reads covering each base within the DNA standard peaked at the first hydroxymethylated Cs of both strands, indicating exonuclease stalling at bead-bound biot-5gmCs (Fig. 1d–f). It is of note that not all DNA strands were digested by the exonuclease up to the first 5hmC since unmodified Cs were found within reads (Fig. 1e and f). In addition, conversion of 5hmC to biot-5gmC is likely to be incomplete since the exonuclease did not stall systematically at the first modified C (a fraction of the reads were covering sequences located more than 40 bases away from the first hydroxymethylated cytosine of the standard, Fig. 1d and e). Analyzing the number of reads covering bases upstream (up to position 19) of the first hydroxymethylated cytosine (position 29) of the DNA standard suggested that the exonuclease did not digest efficiently the 5’ end of the standard in 12.38 % of the cases. Similarly, the rate of lack of exonuclease stalling, probably due to a lack of glycosylation/biotinylation and/or binding to beads, could be inferred from the number of reads starting after the first hydroxymethylated cytosine and was found to be 51.04 %. Accordingly, the probability of not identifying a 5hmC in a replicate of SCL-exo is: 0.1238 + 0.5104 = 0.6342. However, when addressing CpG hydroxymethylation, taking into account information from both strands leads to a probability of not identifying a 5hmCpG of 0.63422 (0.4022). In the case of two replicates, the probability to identify a 5hmCpGs is thus (1–0.40222) × 100 = 83.82 % and raises to 93.49 % when running three replicates. Hence, it is crucial to run several SCL-exo replicates in order to improve 5hmCpGs identification.
In the context of hydroxymethylated CpGs in NPLC genomic DNA, the exonuclease stalled on average 3 bp, and in most cases within a window of 10 bp, upstream of bead-bound biot-5gmCs (Fig. 2a). Reads obtained by single-end HiSeq sequencing of three technical replicates were next mapped to the mouse reference genome and processed as follows to generate SCL-exo signal files: reads including a single CpG within a 10-bp window starting at the 5’ end (42 % of all mapped reads, Additional file 1a) were selected to build single-CpG resolution.wig files whose positions identify hydroxymethylated CpGs (id CpGs) and whose signal represents the coverage of that position (i.e. the sum of the reads from both strands at each given CpG). To avoid calling ambiguities, reads containing two or more CpGs in the first 10 bases (4.7 % of all mapped reads, “probable CpGs” in Additional file 1a) were not retained for analysis. Visualization by the Integrated Genome Browser (IGB) software showed that SCL-exo signal at id CpGs correlated well with hMeDIP-seq peaks (Fig. 2b). Similarly, reads containing CpHs and no CpGs in the first 10 bases (47 % of all mapped reads, Additional file 1a) were used to build a.wig file putatively reflecting non-CpG hydroxymethylation. IGB visualization indicated that the putative CpH hydroxymethylation signal was distributed quite uniformly with no detectable peaks corresponding to hMeDIP peaks (Additional file 1b). In addition, no correlation (Pearson’s correlation coefficient [r] = 0.04) was detected between possible CpH signals from two technical replicates (Additional file 1c). Collectively, these data strongly suggest that, as shown for mouse ES cells [11], most hydroxymethylated cytosines are found in a CpG context in RA-treated P19 cells and that the occurrence of SCL-exo reads not containing CpGs in their first 10 bases was probably due to a suboptimal processing by the exonuclease.
In order to evaluate the reproducibility of the SCL-exo procedure Pearson’s correlation coefficient was determined for id CpG.wig files from two technical replicates of SCL-exo (Fig. 2c). Signals from SCL-exo id CpG replicates showed a high correlation (r = 0.72), indicating that SCL-exo is suited for a reproducible identification of hydroxymethylated CpGs. Notably, non-overlapping SCL-exo id CpGs between two replicates had a lower coverage than overlapping id CpGs (Additional file 1d). Hence, increasing sequencing depth might enhance the reproducibility of the method. Considering that the mean signal of overlapping id CpGs was 1.6-fold higher than the mean signal of non-overlapping id CpGs from two replicates with 48 million reads, increasing sequencing depth up to 1.6 × 48 million reads (≈80 million reads) per replicate could allow higher confidence in the identification of hydroxymethylated CpGs. Finally, SCL-exo id CpG signal showed a fairly good correlation with hMeDIP (r = 0.51, Fig. 2d) and 57.02 % of the SCL-exo id CpGs with at least 20× coverage were included in hMeDIP peaks (Fig. 2e). As a possible readout of exonuclase undigested DNA fragments, we selected unique CpGs contained in the first 10 bases of reads obtained by SCL-seq without exonuclease digestion to build a SCL-seq id CpG.wig file. This SCL-seq id CpG signal did not correlate with hMeDIP (r = 0.07, Fig. 2f) and SCL-exo (r = 0.01, Fig. 2g). The mean signal (number of reads) at SCL-seq id CpGs was 2.03 and could thus be considered as a threshold for false identification of hydroxymethylated CpGs. Hence, for the subsequent analysis, only CpGs identified in at least two out of three SCL-exo replicates (consensus id CpGs) at a threshold arbitrarily set to 8× coverage were considered (178,218 id CpGs). The status of hydroxymethylation of 27 selected CpGs from this set (consensus id CpGs) and included in a MspI CCGG restriction site was next verified by using the EpiMark 5-hmC and 5-mC Analysis Kit (New England Biolabs) which allows a quantitative determination of the percentage of hydroxymethylation of CpGs thanks to the insensitivity of glucosylated ChmCGG sites to MspI cleavage. A strong correlation (r = 0.79) between EpiMark and SCL-exo data was observed (Fig. 2h and Additional file 1e). This is in the range of what has been observed when Aba-seq and EpiMark data were compared (r = 0.72) for hydroxymethylated CpGs in ES cells [16]. Interestingly, the hydroxymethylated status of SCL-exo id CpGs not included in hMeDIP peaks was systematically validated with the EpiMark kit, thus indicating that those were not false positive (Additional file 1e). Comparison between the two techniques suggested however that below a threshold of 8 % of hydroxymethylation (as assessed by EpiMark), CpGs were not efficiently identified by SCL-exo (Fig. 2h and Additional file 1e). Here again, increasing sequencing depth might increase the identification rate of these poorly hydroxymethylated CpGs. This correlation study allowed us to estimate that CpGs showing 20 % hydroxymethylation by EpiMark should have an approximate 16× coverage by SCL-exo. Using this threshold of coverage, the calculated overlap between id CpGs from two replicates of SCL-exo was 53.6 %. The genome-wide distribution of SCL-exo id CpGs and probable CpGs was next interrogated with the CEAS annotation tool [20]. As already described for 5hmC-enriched regions from P19 cells recovered by immunoprecipitation [5], SCL-exo id CpGs were particularly enriched in introns (p = 9.7e−256) and promoters (p = 1.1e−49), although exons might be slightly under-represented due to the fact that “probable CpGs,” which are found in exons for 6.7 % of them, were not included in the analysis (Additional file 1f). In addition, inclusion of id CpGs in enhancers (H3K4me1 positive regions), either active (positive for H3K27ac) or primed (negative for H3K27ac), was proportional to the depth of coverage, suggesting that SCL-exo-identified CpGs with high coverage are likely to be included in functional enhancers.
Since SCL-exo identified hydroxymethylated CpGs in genomic regions not enriched by immunoprecipitation (hMeDIP), a technique which efficacy is known to depend on CpG density [10], the relationship between CpG abundance in regions containing SCL-exo id CpGs and the associated SCL-exo coverage was next investigated and compared to hMeDIP and TET1 ChIP-seq signals from RA-treated P19 cells (GSM941665 and GSM941681 respectively). Although TET1 enrichment was strongly correlated to CpG density (Fig. 3a), such a correlation was not observed for the SCL-exo signal at id CpGs (Fig. 3b–d). Indeed, id CpGs in regions with a unique CpG in 300 bp had a fairly high mean coverage (19.70 ± 0.1), the maximum mean id CpG coverage (reached for regions with 9 CpGs in 300 bp) being only 1.11-fold higher (21.91 ± 0.26 - Fig. 3d). Conversely, hMeDIP-seq signal was clearly dependent on CpG density and regions with a unique hydroxymethylated CpG were not immunoprecipitated (Fig. 3e). Accordingly, inclusion of SCL-exo id CpGs in hMeDIP peaks increased as a function of CpG density (Fig. 3f). These data indicate that SCL-exo is more sensitive than hMeDIP for detecting hydroxymethylated CpGs at low density.
We next implemented SCL-exo with genomic DNA from E14 mouse ES cells (see sequencing statistics in Additional file 2a) and compared the data with previously published TAB-seq or Aba-seq datasets in E14 ES cells [16, 21]. Surprisingly, hydroxymethylated CpGs were not consistently identified within the three technical replicates of SCL-exo (Additional file 2b) and only 13.97 % of id CpGs with at least 20× coverage overlapped between two replicates (Fig. 4a), whereas their SCL-exo signals showed a rather poor correlation (r = 0.18, Fig. 4b). Most notably, a similar lack of reproducibility was observed for high confidence (≥20 % hydroxymethylation) CpGs identified by TAB-seq (Fig. 4a and b, right panels). However, running the SCL-exo CpG identification algorithm on the E14 input-seq reads showed that the number of false positive CpGs identified within the Input-seq dropped abruptly when coverage increased (Additional file 2c) whereas the number of SCL-exo id CpGs remained quite stable within the same range of coverage, indicating that SCL-exo id CpGs in E14 cells are most likely to be truely hydroxymethylated. Collectively, these data suggest that the E14 mESC hydroxymethylome might be extremely variable from cell to cell, a conclusion which is in accordance with the known variability of the methylome of ES cells [22–24]. Despite this variability, a significant fraction of the CpGs identified either by SCL-exo or by TAB-seq overlapped, especially at high coverage (Fig. 4c). This was also observed when comparing SCL-exo and Aba-seq data (Fig. 4c). Importantly, as noticed for RA-treated P19 cells, a large fraction of SCL-exo id CpGs fell within hMeDIP-seq peaks (Fig. 4d, left panel). This was also true for TAB-seq id CpGs (Fig. 4d, right panel), but variability was still observed between TAB-seq and SCL-exo id CpGs overlapping with hMeDIP peaks (Additional file 2d). These data indicate that, although cell to cell heterogeneity in E14 ESCs might hinder reproducible identification of unique 5hmCpGs, SCL-exo and TAB-seq identify similar regions as being hydroxymethylated in E14 ESCs. Accordingly, functional annotation of the SCL-exo and TAB-seq id CpGs overlapping hMeDIP peaks in E14 cells using GREAT (http://bejerano.stanford.edu/great/) generated similar terms for both sets of CpGs (Additional file 2e). However, analysis of the bulk id CpGs from both techniques (i.e. without selecting those overlapping with hMeDIP peaks) using standard settings of the GREAT annotation tool retrieved functional annotation terms only for SCL-exo id CpGs (Fig. 4e).
Aside from generating information on the location of 5hmC-enriched regions in the genome, single-CpG resolution of SCL-exo allows to interrogate databases for particular TFBS motif enrichment with high precision. To this aim, sequences including hydroxymethylated CpG positions in P19-derived NPLCs were searched for motifs with the SeqPos motif tool from Cistrome (http://cistrome.dfci.harvard.edu, [25]). Retrieved motifs included the CpG-containing E box (N-Myc), E2F, ATF6, and EGR motifs with the highest probability (p = 1e−30, Fig. 5a). These particular motifs were also found to be enriched (p = 1e−30) in a set of CpG-containing sequences picked at random in the genome (Fig. 5a). Nonetheless, Z-scores calculated by SeqPos clearly indicated a specific enrichment for the CGTG-containing E-box, ATF, and EGR motifs in SCL-exo identified regions versus random CpG regions (Fig. 5a). These data suggest that TET targeting is biased towards CpGs included in a CGTG motif and, as a consequence, that DNA methylation/demethylation could regulate the activity of E-box, EGR and ATF motif-containing regions on a wide scale in vivo, as already suggested for unique regions [26–28]. However, according to their Z-score, the de novo motifs ACGTG and CACGT ranked before known TFBSs (Fig. 5a). These two motifs shared the ACGT sequence which was previously shown to be preferentially methylated by the DNA methyltransferases 3a and 3b (DNMT3s) together with other RCGY (R = A or G, and Y = C or T) motifs compared to YCGR motifs [29]. Accordingly, a higher incidence of CpG hydroxymethylation at RCGY motifs compared to YCGR motifs was evidenced through the analysis of motif densities around SCL-exo id CpGs (Fig. 5b and c). Since TETs might not have sequence selectivity as suggested by the structure of TET2 complexed with DNA [30], the detected bias in TET targeting towards RCGY motifs might actually reflect a preferential targeting of these motifs by DNMT3s in P19 cells.
Cytosine hydroxymethylation has been associated with the activity of enhancers, as evidenced for those bound by the homeodomain transcription factor Meis1 [5]. Enhancers are genomic regions enriched in multiple TFBSs which are characterized by their conservation across species [31]. In order to further investigate a potential relationship between hydroxymethylated CpGs and transcription factor binding, the conservation status of SCL-exo id CpGs from a subset of enhancers encompassing ChIP-seq-identified Meis1-bound TGACAG binding sites in NPLCs was examined. Although id CpGs were included in highly conserved genomic regions (Fig. 6a, left panel), these CpGs were themselves poorly conserved (Fig. 6a, close-up view, right panel). As an example, Fig. 6b shows that, although embedded in a highly conserved region, the observed SCL-exo id CpG is present only in the mouse genome. The lack of conservation of hydroxymethylated CpGs could also be inferred from TAB-seq data in E14 ES cells (Additional file 3a), indicating that it is not an artifact due to the SCL-exo method and, most importantly, that it is also true for ES cells. Conversely, Meis1-bound TGACAG motifs in P19 cells showed high conservation, as did c-Myc- and N-Myc-bound CACGTG motifs in ES cells, whereas SCL-exo id CpGs overlapping with CACGTG sites showed a lack of conservation (Additional file 3b–e). These data indicate that hydroxymethylated CpGs are not conserved among vertebrates and highlight the possibility that transcription factor-bound CpGs are protected from loss during evolution by a lack or a high turnover of cytosine modifications. In support of this hypothesis, c-Myc- and N-Myc-bound CACGTG motifs in ES cells were not found to be enriched in 5hmC by TAB-seq (Additional file 3f), and only 14 N-Myc-bound motifs (out of 313) were identified by SCL-exo (Additional file 3g). Of note, these 14 motifs showed a drop in conservation at the level of the CpG dinucleotide (Additional file 3h). Next, SCL-exo id CpGs from NPLCs were clustered into three groups according to their level of conservation (Fig. 6c). As expected, a majority of id CpGs showed extremely low to no conservation (i.e. 68.5 % had a PhastCons score below 0.075). These clusters were analyzed for motif enrichment (Fig. 6d), genomic distribution (Additional file 3i), SCL-exo signal (Additional file 3j), and H3K4 monomethylation (Additional file 3k), and results showed that regions containing hydroxymethylated CpGs share similar characteristics, independently of their CpG conservation status. Hence, these results suggest that conservation during evolution does not represent a driving force for CpG targeting by TETs.