Processive and stochastic CHH methylation by plant DRM2 and CMT2 revealed by single-read methylome analysis

Cytosine methylome data is commonly generated through next-generation sequencing (NGS). Analyses of this data average methylation states of individual reads. We propose an alternate method of analysing single-read methylome data. Using this method, we identified patterns that relate to the mechanism of two plant non-CG methylating enzymes, DRM2 and CMT2: DRM2 has higher processivity than CMT2, and DRM2-methylated regions have higher variation among cells. Based on these patterns, we developed a classifier that predicts enzyme activity in different species and tissues. To facilitate further single-read analyses, we developed a genome browser optimised for visualising and analysing NGS data at single-read resolution.

The relationship between changes in DNA methylation patterns and gene expression is not trivial, as it involves a non-linear, additive effect of multiple methylation contexts, along with the effect of additional levels of epigenetic regulation, including chromatin structure, histone position and modifications [1,17,18]. Additionally, the most common method for studying DNA methylation (bisulfite sequencing, or BS-seq), does not provide information on the methylation states of individual cells. BS-seq involves a chemical reaction that converts unmethylated cytosines into uracil, which are subsequently read as thymine when sequenced [19]. Sequencing data produced by BS-seq consists of short DNA fragments originating from a random subset of cells in the sample tissue; cytosines that have not been converted to thymine are assumed to represent methylated cytosines in the source genome [20,21]. Hypothetically, each read relates to the methylation state of a single cell in the sample, and so the collection of reads will reflect methylation heterogeneity present in the sample tissue. However, BS-seq methylome data is most commonly averaged among reads overlapping the same region. Thus, the output signal of BS-seq analysis pipelines combine populations that may have fundamentally different methylation levels.
While alternative methods exist for generating methylomes that address this issue, namely single-cell BS-seq, these methods are currently not feasible for all organisms and tissues [22][23][24][25]. As a consequence, most currently available methylome data is not singlecell; analyses that can decode additional dimensions of information from this type of data are of high potential value in producing more insights from new and existing data. One such analysis was recently proposed to produce a heterogeneity signal from CG methylation patterns among cells [26]. This tool calculates the Shannon entropy of reads overlapping a set of CG sites, and identifies unique patterns of methylation within this subsample, as relating to heterogeneity within the sample population of cells.
We were interested in extracting additional information from BS-seq data, specifically relating to CHH methylation, which could identify enzyme-specific patterns of methylation. To this end, we designed a single-read analysis pipeline that extrapolates multiple dimensions of methylation variation, using NGS reads from either a single-region (collection of CHH sites) or from functionally-similar sets of regions. This analysis revealed that DRM2 and CMT2 have distinctive methylation patterns at both single-cell and population levels. DRM2-methylated reads show higher processivity, whereas CMT2-methylated reads and regions are more stochastically methylated. These findings shed new light on the distinct mechanisms of CHH methylating enzymes and can be utilized for prediction purposes. By characterizing these patterns in Arabidopsis thaliana mutants of these enzymes, we developed a classifier that can predict the identity of the enzyme that methylates a particular region. Importantly, the classifier does not rely on a comparison to mutants of the same species or tissue. At a genome level, it can predict the presence or absence of DRM2-like or CMT2-like activity. After validating the classifier, we used it to predict null DRM2 CHH methylation activity and to associate CMT2 methylation pattern to that of DNMT3 methylation signal in the early land plant species and human cells.
Our analyses use BS-seq data at single-read resolution. To facilitate further analyses, we developed a genome browser, Single-Read Browser (SRBrowse), that is optimised for visualizing and analysing NGS data at single-read resolution. The tool, which has a single user interface for browsing and analyses, can directly process local NGS data or NCBI accessions into an optimised format for display in the genome browser.

Results
Designing a single-read analysis pipeline for CHH-methylation Single-read analyses can be usefully applied to any NGS data where short reads vary in a way that reflects a biological signal. With BS-seq data, the importance of this analysis is that methylation varies between cells meaningfully, and each read hypothetically reflects the state of a single cell. We chose to focus on CHH methylation for a number of reasons: (i) CHH sites are 2-3 times more common than CG/CHG sites. Given that site density dictates the amount of information that can be deduced from a single read, contexts with a higher density allow more data to be retrieve from individual samples of varying coverage; (ii) as opposed to CG sites the methylation of which is usually binary, CHH sites are mostly partially methylated [20,21,27]. Sites that are either unmethylated or fully methylated have low or zero variation among reads ;(iii) CHH methylation is known to vary between tissues [28]; in itself, the fact that CHH sites are partially methylated suggests that most CHH sites are differentially methylated between cells of the sample tissue [20,21]; (iv) CHH sites are methylated by two types of DNMTs, DRM and CMT, which their activity is regulated by distinct molecular mechanisms, RdDM and DDM1-dependent respectively [13,14]. Thus, focusing on CHH sites might expose the potential variation between regions of the same sample due to the different mechanisms.
There are a number of factors that limit the maximal region size, mainly: (1) the average read length and coverage of the sequencing library; and (2) the frequency of the specific methylated context. The expected number of reads per region for a given library can be calculated as: expected reads per region = coverage · (1 − region size read length ) This relationship is illustrated in Figure S1a. The frequency of the methylation context is also important to consider, as selecting regions rich in a particular context can limit analyses to a small subset of the data and thus bias the results of the pipeline (Fig. S1b). Different types of analyses can utilise different filter options. For all analyses except where noted otherwise, regions were selected with 5 CHH sites, up to 30bp length. In a wild type A. thaliana sample [14], this includes 58% of regions from TEs containing 5 CHH sites with >= 5% methylation (Fig. S1c).
In order to study individual-and population-level variation in methylation, the pipeline segments the genome into short regions of a limited length of similar functional elements or annotations, e.g. TEs, genes, exons, histone marks, chromatin structure, etc. Due to the asymmetry of CHH sites, these regions are defined per strand. For instance, a short region could span 30 base pairs and have 5 CHH sites (Fig. 1a). Reads overlapping with all 5 CHH sites are then scored according to the methylation state of these sites, within the read (Fig. 1a). We defined three features of methylation variation: (1) the standard deviation of CHH site methylation; (2) the standard deviation of read methylation; and (3) stochasticity (Fig. 1b). These features relate to functional differences between methylation patterns: (1) higher variation among sites can reflect fluctuations in the methylation signal and/or CHH subcontext specificity of the enzyme; (2) higher variation among reads can reflect differential regulation of methylation among cells composing the sample; and (3) higher stochasticity can reflect lower processivity in the mechanism of the methylating enzyme, and/or subcontext specificity. a b Figure 1: Pipeline for analysing variation of single-read data regarding CHH methylated regions. (a) Schematic representation of pipeline: regions of up to a set span of base pairs with set number of CHH sites are selected; reads overlapping these sites are then quantified by the number of methylated C sites. (b) Schematic representation of different types of variation within methylated regions: (i) regional variation, relating to differences in methylation average between adjacent CHH sites; (ii) read methylation variation, relating to differences in methylation level among reads overlapping the same region, taken to represent the sample population of cells; (iii) within-cell variation, relating to differences between adjacent CHH sites within each read.
Read-level DRM2 CHH methylation activity is more processive than that of CMT2 To compare CMT2 and DRM2 methylation patterns, we used BS-seq data of mutants defective in the respective enzymes [14]. The mutants included cmt2 (three mutated alleles), and drd1 and drm2 (which both affect DRM2 activity [14,29]). The assumption was that, given the complementary activity of these enzymes, the remaining methylation in cmt2 should consist of DRM2-methylated reads, whereas methylation in drd1 or drm2 should consist of CMT2-methylated reads [13,14]. As the methylation pattern in the three cmt2 mutants was similar, we combined the methylation data of these samples.
Initially we tested whether the methylation level of individual reads differs between the mutants on a genomic scale. To do this, reads were collected from regions conforming to the standard parameters of our analysis (30bp max region span, 5 CHH sites). Reads were scored based on the methylation of the CHH sites from 0 to 5, removing reads that were not methylated at any site. Figure 2a shows the distribution of read methylation level resulting from the above analysis in each of the mutants, selected from regions methylated at three different minimal levels (5%, 20% and 40%). Methylated reads from drd1 and drm2 show a different distribution from the combined cmt2 data: while cmt2 retains a proportion of fully methylated reads, drd1 and drm2 retain mainly partially methylated reads, with a low proportion of fully methylated reads. This pattern is present even in regions with higher average methylation (Fig. 2a, rightmost panel). The difference between lowly-and highly methylated drd1 or drm2 regions is explained exclusively by the methylation state of partially methylated reads ( Fig. S2a-b). In comparison, the proportion of partially methylated reads between low-and highly-methylated regions in cmt2 is similar, while methylation is correlated to the proportion of fully methylated reads (Fig. S2a). Due to the relatively low coverage of the drm2 mutant, we used drd1 for the subsequent analyses, but validated the patterns identified in drd1 using the drm2 mutant and other drm2 mutants. Figure 2a demonstrates that reads from drd1 and cmt2 have different methylation levels. An additional dimension of variation among reads is the processivity of methylation. This is defined as the distribution of methylation within reads, and quantified by counting the number of changes in methylation within the read (e.g. a methylated CHH site adjacent to an unmethylated CHH site on the same strand) out of the total possible number of changes (illustrated in Figure 1b). Figure 2b demonstrates the separation between reads from the respective mutants according to their methylation level and processivity: while drd1 has reads with lower methylation and lower processivity, cmt2 has reads with higher methylation and higher processivity (i.e. low stochasticity). This correlation persist in regions with different CHH content, i.e. 5-15 sites per region (Fig. 2b), suggesting that this pattern does not depend on the density of CHH sites. This result is also consistent for the mutant alleles composing the cmt2 sample (i.e. cmt2-4 cmt2-5 and cmt2-6) and drm2 (Fig. S2c). Overall, these results suggest that DRM2 CHH methylation activity is more processive than that of CMT2.
distinct patterns of CHH methylation in the DNMT mutants could result from demethylase activity. To test this hypothetical scenario, we analysed three different demethylase mutants: single mutants repressor of silencing (ros1-4), demeter (dme-2) and the triple mutant ros1-3, demeter-like protein 2 (dml2-1) and dml3-1 (rdd). Regions methylated in either drd1 or cmt2 were analysed as representing regions methylated by the complementary enzyme; the distribution of read methylation at these regions for each of the wild type/demethylase mutant pairs were plotted. Figure 2c-d summarises this comparison. Figure 2c presents data from CMT2-methylated regions (methylated in drd1 ), while Figure 2d presents reads from DRM2-methylated regions (methylated in cmt2 ). For dme, drm2 and cmt2 mutants of vegetative nucleus tissue from [33] were used. For each demethylase, regions were also selected according to hypermethylation (> 10% increase in methylation) relative to the respective wild type sample. The enzyme-specific pattern is present in the demethylase mutants and its respective wild type sample (Fig. 2c-d): in CMT2 regions, both the mutant and wild type have a low proportion of fully methylated reads, with hypermethylation in the mutant correlating to the increase in partially methylated reads ( Fig. S3a-c, left panels); in DRM2 regions, both the mutant and wild type have fully methylated reads, with hypermethylation in the mutant correlating to the increase in fully methylated reads ( Fig. S3a-c, right panels). This suggests that the enzyme-specific pattern is present prior to demethylase activity; it also suggests that demethylases reduce methylation entirely in specific cells (given the decrease in zero-methylated reads in demethylase mutants).

Variation of CHH methylation among sites and reads distinguishes between CMT2 and DRM2 target regions
Analyses of individual reads can reflect the mechanism of different CHH-methylating enzymes, but the predictive confidence in distinguishing between methylated reads is limited, given that most reads are lowly methylated with a limited range of processivity (Fig. S2c). Hence, to characterise region methylation, we used methylation features that extract information from single-read data (Fig. 1b). The separation of all methylated regions (>= 10% average methylation) from the mutants is shown in Figure 3a-c. Paired with the distributions of features of the actual data (solid lines) are features of read datasets generated using a poisson model of single-C sites, where the chance of methylation per-site, per-read is equal to the mean region methylation (dashed lines). The similarity between the actual and generated data can demonstrate the degree to which the feature distributions of each mutant are explained by stochastic variation at the read level.
CMT2-methylated regions have higher variation among sites, lower variation among reads, and lower average processivity (as suggested by Figure 2b). The read methylation variation and processivity of CMT2-methylated regions overlap with the distributions of generated data (Fig. 3a-c), suggesting that, in the drd1 sample, variation of CMT2mediated CHH-methylation activity among reads is mainly stochastic. By contrast, DRM2methylated regions have low variation among sites, higher variation among reads, and higher average processivity (Fig. 3a-c). Variation among reads in DRM2-methylated regions is not stochastic, suggesting that DRM2-mediated CHH methylation in these regions is differentially regulated. Of these factors, variation among reads best predicts the methylating enzyme of the region (Table 1).
To understand the relationship between these methylation features in the mutants, features were analysed at the level of whole functional elements (in this case, TEs), by averaging Among-site sd.
Read methylation sd.   (Fig. 3d) and in specific TE superfamilies (Fig.  S4a). In the drd1 mutant these features are correlated with a slope of 1, with among-site variation increasing linearly with read methylation variation (Fig. 3d). On the other hand, in the cmt2 mutant the two features are correlated with a smaller slope (0.21), with most TEs having a low average among-site variation (Fig. 3d). This was consistent in each of the TE superfamilies ( Fig S4a). CMT2 shows specificity for the CHH subcontext CWA (i.e. CTA or CAA) [20,21,34]. As this could contribute to higher variation when analysing all CHH subcontexts, CWA and non-CWA subcontexts were analysed separately. For this comparison a larger region size of 50 bp was used, given the lower density of CWA sites (4.5 times lower than that of CHH). CWA contexts had higher variation among reads in drd1, whereas in cmt2 these levels are comparable to all CHH subcontexts (Fig. 3e, S4a). Among-site variation remains similar. The increase in read methylation sd. can be explained by the higher methylation of CWA-methylated regions. In addition, CWA-methylated regions in drd1 have higher read methylation sd. relative to regions methylated at the same level in drd1, but still lower than in cmt2 (Fig. S4b). In non-CWA and all CHH-sites drd1 read methylation variation is similar to that of stochastic methylation. Methylated reads from CWA-methylated regions show a similar pattern in terms of methylation level and methylation processivity, as opposed to non-CWA methylated regions (Fig. S4c-d). This suggests that the CMT2 methylation pattern observed in CHH-methylated sites in drd1 is composed of two distinct patterns; however, even at sites for which CMT2 shows specificity, it has relatively lower processivity compared to DRM2 (Fig S4b, left panel).
Based on ANOVA of the methylation features in the CHH methylation mutants, we designed a classifier to score regions and whole functional elements in terms of CHH methylating enzyme. The results of the model are presented in Table 1. The separation of the mutants used to define the classifier based on enzyme score is presented in Figure 3f (regions) and Figure 3e (whole elements), along with the ROC curve of region prediction (Fig.  3h). Enzyme score ranges from 0 to 1, with lower values indicating a similarity to DRM2 methylation, and higher values indicating a similarity to CMT2 methylation. Each mutant has a single enzyme score peak, and these peaks are aligned for mutants affecting the same enzyme (Fig. 3g). The separation of the mutant samples in Figure 3g, and the ROC curve of the classifier demonstrate the potential of using the classifier to predict enzyme identity. A. thaliana cmt2 mutants from previous studies and drm2 -related mutants from multiple species show similar distributions of enzyme score, suggesting that this distribution is not specific to the mutant samples used to construct the classifier (Fig. S4b-c).
DRM2 and CMT2 methylation patterns are distinct from chromatin structure-dependent patterns DRM2 and CMT2 function at distinct chromatin environments; DRM2 via RdDM is targeted mainly to euchromatic TEs, whereas CMT2 via H3K9me2 is targeted preferentially to heterochromatic TEs [13,14,29]. Accordingly, it is possible that DRM2's and CMT2's distinct CHH methylation activities are influenced by the genomic chromatin environment rather by their intrinsic enzymatic activity. In order to test the role of genomic environment on DRM2 and CMT2 methylation patterns, we correlated enzyme score with GC content (a prominent indicator for chromatin structure [14,18,35]), for individual CHH-methylated regions in drd1 and cmt2 mutants (Fig. 4a). While regions methylated in drd1 showed on average higher GC content than regions methylated in cmt2, no correlation was found between GC content and enzyme score in either mutant (Fig. 4a). We also correlated enzyme score in cmt2, drd1 and drm2 with the following hetero-and eu-chromatic histone marks, H3K9me2 and H3K4me3, respectively [36]. As with GC content, the mutants are separated by both histone marks and enzyme score, but there is no correlation between histone marks and enzyme score within each mutant (Fig. 4b-c). These results suggest that processive and stochastic CHH methylation are intrinsic enzymatic activities of DRM2 and CMT2, respectively.

Plant and human DNMT3s show similar CHH methylation patterns to that of angiosperm CMT2
By identifying methylation patterns associated with either DRM2 or CMT2, the classifier can predict the presence or absence of the activity of either enzyme in samples from different tissues and species. Currently, in the absence of mutants showing partially reduced CHH methylation, it is unclear whether the total methylation pattern present in a given sample is derived from one or more CHH-methylating enzymes, and this distinction can not be made based on average methylation alone.
Mutants that possess only one active CHH methylating enzyme present a simplified case for the classifier, given that there is only one pattern. To assess the ability of the classifier to identify distinct patterns in more diverse samples, the classifier was applied to wild type samples of multiple species (Figure 5a, with the distribution of TEs in individual features presented in Figure 5b). Of the species analysed, A. thaliana, Oryza sativa and Solanum lycopersicum had two peaks, while Physcomitrella patens and Marchantia polymorpha had only one peak. In A. thaliana, the two peaks associate with each of the CHH methylation mutants in Figure 5a. S. lycopersicum and O. sativa have two peaks, similarly to A. thaliana, which are also aligned to the A. thaliana CHH methylation mutants; however, the ratio between the peaks is different. This is also apparent in Figure 5b: both features are defined in a similar range but with different mean values. This ratio relates to the frequency of TEs regulated by either CMT2 or DRM2. For example, rice is known for its exceptional number of MITEs ( 130k) targeted by RNA directed DNA methylation (RdDM) and DRMs [38].
While CG is the predominant methylation context in animals, non-CG methylation can be enhanced in particular tissues, such as the brain [3,40]. Non-CG methylation in mammals (also called CH methylation) is mediated by DNMT3s. In human, two DNMT3s, DNMT3a and DNMT3b, were found to mediate CH methylation [3]. Thus, to test how many CH methylation patterns exist in human data and their relationship to plant ones, we next ran our single-read method on human methylome derived brain tissues. Applied to human CH data using the same parameters as for CHH analyses, our classifier detected a single peak of enzyme score that overlaps that of plant DNMT3 and CMT2 enzymes (Fig. 5e). Distributions of variations of among site and read-methylation also show only a single peak of activity (Fig. 5f). These results suggest that CH methylation in neurons is mediated by a single dominant enzymatic pattern that is similar to that of plant DNMT3 and CMT2.
Conclusively, these results demonstrate the use of the enzyme classifier in predicting the presence or absence of CMT2-or DRM2-like methylating activity at a genomic scale.

Tissue-specific samples have different proportions of CMT2/DRM2methylated regions
The enzyme classifier relies on methylation features the range of which may be biased by sample composition. For example, read methylation may vary less within homogeneous samples, if methylation patterns are similar between cells. Given that read methylation variation is the strongest predictor of enzyme identity (r 2 = 0.589), the effectiveness of the classifier may be limited in such samples. In order to assess the ability of the classifier to function in tissue-specific samples, we used datasets from two studies that produced methylomes of sperm and vegetative nucleus cells [30], and root tissue subsamples [28]. Given that in each of these studies, altered regulation of CMT2/DRM2 activity was observed in one or more of the samples, this analysis also provided a means of validating the predictions of the classifier. Figure 6a-b demonstrates the differences in CHH methylation regulation among tissuespecific samples of A. thaliana: all samples contain two peaks; however, in some tissues enzyme activity shifts, with both sperm and root tip having more DRM2-methylated TEs. drm12 and cmt2 mutants from the same study [30] were also analysed, and both vegetative nucleus and sperm mutants are distinguishable based on enzyme score analysis (Fig. 6cd). As noted above, the wild type sperm sample has less CMT2 activity compared to the vegetative nucleus and other wild type A. thaliana samples analysed, confirming previous findings showing a reduced CHH methylation in heterochromatic TEs targeted by CMT2 [30].
Individual TEs and regions that are methylated in both sperm and vegetative nucleus are on average more similar to DRM2-methylated regions (Fig. S7a-b). In addition, relative to other DRM2-related mutants, the enzyme score peak of the sperm drm12 mutant is more DRM2-like (Fig. 6d). Interestingly, the sperm drm12 has higher read methylation variation, but lower among-site variation (Fig. 6b). This reflects the signal in the wild type sperm sample, in which overall among-site variation is low compared to other A. thaliana wild type tissues (Fig. 6b).
Compared to wild type sperm, drm12 sperm TEs are more regulated by CMT2 (Fig.  6f). This change is partly due to the loss of DRM2-methylated regions from TEs, rather than CMT2 methylating previously DRM2-methylated regions. However, the same change is observed also when comparing individual regions: regions retaining methylation in drm12 sperm, in particular regions that are defined by the classifier as regulated by DRM2 in the wild type sample, are shifted towards a CMT2-like signal (Fig. S7c, left panel). In the vegetative nucleus, the same shift is observed ( Fig. 6f; S7d, left panel). In the cmt2 mutant, in both sperm and vegetative nucleus, no change is observed at the level of individual regions (Fig. S7c-d, right panels).
We also analysed three root samples that differed in their enzyme score distributions for whole TEs: root tip (RT), columella root cap (CRC), and lower columella (LC) [28]. The root tip sample contains both CRC and LC, but is in itself different from other wild type tissues analysed (Fig. 6a). Similarly to sperm, it has a lower average among-site variation. In terms of enzyme score, all three samples have two peaks (Fig. S7e), but the distributions are not overlapping. A comparison of enzyme score of individual TEs between samples shows that in the LC sample, all TEs are shifted towards a DRM2-like signal (Fig.  S7f); TEs with intermediate signals (e.g. regulated by both enzymes) are more shifted than those with more defined DRM2/CMT2 signals (e.g. regulated by one enzyme). The same shift is present also in the CRC sample relative to RT (Fig. S7g). Overall, these results (e-f) Correlation between enzyme score of individual TEs in wild type compared to mutant tissues. Only elements that were methylated in both wild type and respective mutant were selected. Each dot represents a single TE, with the distance from dashed line (f (x) = x) indicating a shift in enzyme score between samples (i.e. change in the average CHH enzyme methylating the element). (g) Schematic representation of the methylation patterns of CMT2 and DRM2 identified in this study: CMT2-mediated methylation has lower processivity and has a unimodal distribution among cells; DRM2mediated methylation has higher processivity and has a bimodal distribution among cells.
demonstrate the ability of the classifier to predict changes in the activity level of DRM2 and CMT2 in different tissues.
SRBrowse, a tool for visualising and analysing BS-seq data at singleread resolution Most popular genome browsers, including UCSC genome browser and Integrative Genomics Viewer (IGV) [41], are limited in their ability to load high resolution data on-the-fly without creating a large memory footprint, significantly increasing the load times of browser displays or requiring pre-processing of data. In order to visualise single read data, it is necessary to convert aligned read data (e.g. SAM/BAM files) into track files (such as GFF) or compressed indexed files (BED, TDF, etc.) suitable for fast retrieval.
In order to make browsing and analysing BS-seq data at single-read resolution more accessible, we developed a genome browser specifically designed for visualising BS-seq data at a single-read level, which we called SRBrowse. SRBrowse is web browser-based and can run on a local computer or server with minimal software requirements (see the repository README file on installation). Importantly, SRBrowse allows users to load data into the browser view and monitor its alignment progress from the same interface. Typical steps for loading and displaying data appear in Figure S8. All data loaded through SRBrowse is aligned using bowtie2 and stored in indexed files allowing fast access to the read data.

Discussion
We presented a novel analysis pipeline for extracting additional layers of information from NGS BS-seq data. The pipeline uses data of individual read methylation states and the distribution of DNA methylation within reads to identify patterns that augment information regarding the averaged methylation signal. Using this pipeline, we were able to define characteristic features of reads and regions methylated by two non-CG methylating enzymes CMT2 and DRM2 in A. thaliana. Specifically, we found that A. thaliana mutants of CMT2 and DRM2 present stereotypical CHH methylation patterns that are robust to background methylation and consistent among different mutated alleles and species (Fig. 5, S6). These patterns are also independent of demethylase-activity: in the absence of demethylase activity, the same distinct patterns are observed in regions regulated by each enzyme (Fig. 2c-d,  S3). On the one hand, cmt2 mutants have mainly highly methylated reads, and methylation is concentrated within specific regions; on the other hand, drm2 and drd1 mutants have mainly lowly methylated reads, and methylation is distributed stochastically within and among reads. In other words, our analysis suggests that, compared to DRM2, CMT2 has lower processivity and presents more stochastic variation of methylation level among cells. In contrast, DRM2-methylated regions present distinct subpopulations of methylation states, with less stochastic variation (Fig. 6g).
Variation in DRM2 and CMT2 methylation characteristics could relate to the distinct genomic targets of these enzymes. DRM2 methylates mostly short euchromatic-TE sequences located next to genes and CMT2 methylates mainly long heterochromatic-TE sequences [13,14]. Thus, DRM2 subpopulations of highly processive methylation and unmethylation states could relate to the ability of DRM2 methylation to regulate genes within particular cell types or under certain conditions (Fig. 6g), for example in the formation of lateral root development [42]. In contrast, CMT2 low but uniform methylation correlates with constant need to silence heterochromatic TEs (Fig. 6g). Furthermore, given that we found no correlation between chromatin structure markers and CMT2 and DRM2 single read methylation patterns (Fig. 4), this suggests that the biological significance of CMT2 and DRM2 distinct methylation patterns are determined by their intrinsic enzymatic characteristics rather than by the chromatin environment.
Based on the variation of these patterns between CHH-methylating mutants, we designed a classifier that scores short regions of 30 base pairs and collections of regions within functional elements (such as genes, exons or transposable elements). This score provides an arbitrary scale to differentiate between DRM2-like and CMT2-like CHH methylation patterns. The comparison among species highlights the ability of the classifier to predict the presence or absence of CMT2/DRM2 in species for which mutants have not yet been developed, such as M. polymorpha. The classifier is robust to differences in sample heterogeneity, and is able to differentiate between methylation patterns even within highly specific tissue samples (Fig. 6).
While DRM2 and plants as well as human DNMT3 are monophyletic and distantly related to CMT2, DRM2 is the only enzyme that has a rearranged catalytic domain [16]. Therefore, our findings that DNMT3 and CMT2 have similar CHH methylation characteristics suggest that different DNMTs can have similar methylation mechanisms, and substantiates the hypothesis that DNMT3 CHH methylation activity in early land plants has been replaced by CMT2 in angiosperm [16]. Moreover, the DRM2 unique high-processivity methylation activity can be associated to its exclusive rearranged catalytic domain rather than to its general homology to DNMT3 enzymes.

Conclusion
Overall, the analyses of methylation profiles we present here demonstrate the potential of studying patterns of variation in BS-seq data, through single-read analysis, which provide new biological insights on the writing, erasing, and readout mechanisms of CHH methylation. The tool we developed can facilitate further studies of methylomes at single-read resolution.

BS-seq alignment
All code used for the read analysis pipeline will be deposited in a publicly available code repository (https://github.com/zemachlab/srbrowse). For aligning reads from BS-seq data we used bowtie2 with a Node.js-based wrapper. The method we used for aligning BS-seq data is based on a previously described pipeline [14]. The wrapper converts the reference assembly to C-to-T and G-to-A sequences before bowtie2 indexing; each strand is converted manually so that each index consists both of forward and reverse strand versions of each scaffold. BS-seq reads are converted either C-to-T or G-to-A depending on whether the read is a left or right mate (in the case of paired ends reads), with the original read data stored for collecting methylation information after alignment. Bowtie2 was run with end-to-end search algorithm. For all datasets a minimum score of 0 was used (i.e. no mismatches or gaps). Reads that mapped to more than one position were discarded. Aligned reads were then sorted and exact duplicates removed.

Analyses of BS-seq data
Analyses consist of three stages: (1) identifying short regions (10-100bp) according to the region selection parameters (see 2.1); (2) extracting reads overlapping with each region from the selected samples; and (3) averaging read data. Region selection parameters were optimised to ensure sufficient data for low coverage samples (Fig. S1a-c). Functional elements are first selected according to an annotation provided in gff/gff3 format. Next, sites of specific methylation contexts are identified based on the reference sequence of the element (e.g. CHH sites), per strand. Separation to strands is important for asymmetrical contexts such as CHH. Regions are defined by iterating through sites until N sites within the defined region size are found. Reads from aligned BS-seq samples are retrieved from indexed lists of reads, and stored as binary arrays where each site is represented as either unmethylated (0) or methylated (1).
The read methylation data of a specific region were analysed to produce methylation features (Fig. 1b) of individual reads and their associated regions. For individual reads these features are (1) read methylation, the mean of the vector components, and (2) read stochasticity, the number of changes between methylation states between adjacent sites (for illustration, see Figure 1b). Importantly, features of individual reads do not refer to the overall methylation of the read, but only to methylation at the sites included in the specific region. For regions, the methylation features are (1) mean read methylation; (2) standard deviation of read methylation; (3) mean read stochasticity; and (4) standard deviation of site methylation. The last feature is not based on single-read data but rather the averaged methylation signal at each site.
The output data of this analysis can be either at the level of individual reads, individual regions, or whole functional elements. For reads data, the output is an array of reads for each sample from all regions matching the selection parameters. For regions, the output is an array of regions with features derived from averaged read vectors as explained above. The regions also have positional data relating to their parent element, length, and any other genomic features of interest (GC content, etc.). For whole functional elements, the output is an array of elements such as exons or transposable elements, where methylation features relate to the average of all regions identified with the functional element. The exclusion of regions based on methylation or coverage prior to averaging whole elements is important to reduce background of unmethylated regions. Unless stated otherwise, regions were selected with a minimum of 10% methylation average and 4 overlapping reads. Functional elements that contained at least two such regions were selected. While increasing the minimum regions per element improves coverage per element, it can bias the analysis to longer elements.

Statistical analyses
All statistical analyses we performed on either read, region or element data resulting from the above pipeline, using Python 3.6 along with the following libraries: matplotlib, numpy, scipy, statsmodels, pandas and seaborn. K-density plots were produced using seaborn.distplot (which implements statsmodels.nonparametric.kde.KDEUnivariate) with gaussian kernel shape and Scott's Rule of Thumb bandwidth. For ANOVA of methylation features we used methylated regions from the CHH methylation mutants drd1 and cmt2 (composed of data from cmt2-4, cmt2-5 and cmt2-6 mutant alleles). Methylation features of the regions were provided as independent variables, and sample source (0 for drd1, 1 for cmt2 ) as the dependent variable. The results of the ordinary least squares model are presented in Table  S1. For the classifier, the resulting coefficients of the independent variables were scaled