Data and global patterns
We obtained whole-methylome bisulfite sequencing data at single-base resolution from peripheral blood mononuclear cells (PBMCs) of three individuals in a family trio: Father (F), Mother (M) and Daughter (D) from our previous study (Lee HM et al., Discovery of type 2 diabetes genes using a multiomic analysis in a family trio, submitted). Correspondingly, we extracted total RNA from the three samples and performed whole-transcriptome shotgun sequencing. After data preprocessing, about 95% of the resulting reads were uniquely mapped to the human reference genome (Additional file 1: Table S1).
High correlations between methylation patterns in the three genomes
We first explored the global patterns of DNA methylation in the three individuals. Overall, both the absolute number of methylated cytosines within CpG dinucleotides (mCG) in 10 kb sliding windows and the density of methylated cytosines with respect to the total number of CpG dinucleotides within the window (mCG/CG) are highly correlated among the three individuals (Additional file 1: Figure S1 for the whole genome, Figure 1 for chromosome 1 as an example). The methylation density measure mCG/CG has been commonly used in various methylome studies to quantify DNA methylation level [4],[22],[28]. To check if our data were able to capture subtle DNA methylation differences among the three individuals, we computed the correlation of every 15 adjacent windows between each of the three pairs of individuals. To filter out local fluctuations due to intrinsic randomness in sequencing experiments, we progressively increased the window size from 10 kb to 250 kb. When the window size was 10 kb, both mCG and mCG/CG identified a lot of regions with poor correlation between two individuals (Additional file 1: Figures S2–S7), signifying regions with potential differential methylation status. When the window size was increased, the number of poorly correlated regions decreased for both methylation measures, but the decrease was more rapid for mCG, indicating that mCG/CG is more sensitive to small fluctuations, in particular in windows that contain a small number of CpG dinucleotides.
We collected the low-correlation regions that consistently showed up on the lists at different window sizes, and used DAVID [41] to test for any functional enrichment of the genes inside these regions. At a significance level of p = 0.05 after correcting for multiple hypothesis testing using the Benjamini-Hochberg procedure, some terms were significantly enriched in these genes, including O-methyltransferase (p = 0.0057), melatonin metabolic process (p = 0.023) and hormone biosynthetic process (p = 0.047) (Additional file 1: Table S2). Notably, melatonin secretion was known to be associated with type 2 diabetes (T2D) [42]. As two of the three samples in our study were obtained from individuals with T2D (Lee HM et al., Discovery of type 2 diabetes genes using a multiomic analysis in a family trio, submitted), our results indicated that our data were able to capture relevant information related to the physiological status of the samples.
L-shaped patterns between methylated CpG count and gene expression
We then examined the relationship between DNA methylation and expression levels of genes in the three individuals. We computed the average methylation level along each gene, considering both the gene body and upstream regions, and plotted these methylation levels against the corresponding expression levels. The resulting scatterplot based on the mCG quantification measure of DNA methylation (Figure 2A) displays a very clear “L” shape, in which genes with very high expression levels all display very low methylation levels, and genes with very high methylation levels all show very low expression levels. This pattern suggests that for these extreme cases, there is a negative correlation between DNA methylation and gene expression. However, the majority of genes have both low methylation and expression levels, and the global correlations between DNA methylation and gene expression when all genes are considered were not strong (Figure 2A, Pearson correlation =−0.0486, Spearman correlation = 0.0709), despite significant p-value of the Pearson correlation due to the large number of genesinvolved.
In contrast, the plot based on the normalized measure, mCG/CG, does not display an L-shaped pattern, but rather shows a more global negative correlation with gene expression (Figure 2B, Pearson correlation =−0.1293, Spearman correlation =−0.3705). When the methylation levels were plotted against log expression values instead, the L-shaped patterns became less clear (Additional file 1: Figure S12a,b), but DNA methylation and gene expression were still observed to be weakly anti-correlated.
To get a better understanding on the differences that exist between different quantification measures for DNA methylation, we also normalized mCG by the total length of the measured region (gene body and upstream regions in this case), or by both the number of CpG sites and the region length. We denote these measures as mCG/len and mCG/CG/len, respectively. The two corresponding scatterplots both exhibit some weaker L-shaped patterns (Figure 2C and D).
These observed differences led us to check whether we could find positive correlations between gene body methylation and expression levels as reported in some previous studies [28],[29]. To do that, instead of considering both upstream regions and gene bodies at the same time as in Figure 2, we made separate scatterplots for upstream regions (Additional file 1: Figure S8) and gene bodies (Additional file 1: Figure S9). We found a weak positive correlation between gene body methylation and gene expression for the mCG/len measure based on Spearman correlation (Additional file 1: Figure S9c). However, for the other quantification measures, no significant global correlations were observed. For mCG, L-shaped patterns were observed for both upstream regions and gene bodies (Additional file 1: Figures S8a and S9a). We also checked exons (Additional file 1: Figure S10) and introns (Additional file 1: Figure S11) separately, and found no significant differences between the global patterns of these plots and those in which they were taken together as gene bodies. Again, plots based on log expression levels exhibited similar correlation values but less apparent visual patterns (Additional file 1: Figures S12–S16).
These initial results indicate that the relationship between DNA methylation and gene expression is complex and non-linear. The expression levels of genes with very strong methylation levels appear much more affected by DNA methylation than other genes. Whether DNA methylation at promoters and gene bodies have opposite quantitative relationships with gene expression also warrants further investigation.
Quantitative modeling
To systematically study the quantitative relationships between DNA methylation and gene expression, we performed statistical modeling by means of machine learning. The idea is to compute DNA methylation levels at different sub-regions of a gene as its features, and construct a model that can tell the expression level of any given gene based on its features. The accuracy of a model can be quantified by comparing the model outputs and the actual expression levels of the genes measured by RNA-seq. We constructed different models using different sub-regions and DNA methylation measures, to test which ones could better explain the observed expression levels.
Specifically, for each annotated gene, we computed methylation levels in 16 different sub-regions around its gene body and flanking regions (Figure 3). Within the gene body, we defined 6 sub-regions as in a previous study [22], namely first exon (FirstEx), first intron (FirstInt), internal exons (IntnEx), internal introns (IntnInt), last exon (LastEx), and last intron (LastInt). For the upstream and downstream regions, we defined 5 non-overlapping 400 bp sub-regions each (Up1-Up5 and Dw1-Dw5, respectively). We divided all genes into four equal-sized classes according to their expression levels, namely Highest, Medium-high, Medium-low and Lowest, which correspond to genes with expression within the first, second, third and fourth quartiles, respectively. In the first set of models, we combined the data from the three individuals to maximize the amount of data for model construction, resulting in 53,535 (17,845 × 3) data records from 17,845 genes. We tested our models using a left-out procedure, in which two-thirds of the genes from all three individuals were used in model training, and the accuracy of a model was evaluated using the remaining one-third of the genes. We then repeated the procedure 5 times using different random training and testing sets and reported their average accuracy, to ensure the reliability of the results.
DNA methylation is partially indicative of expression class
We first constructed models with all DNA methylation features from the 16 sub-regions of each gene, using the mCG methylation measure. We tried 11 different model construction methods, and found that the Random Forest method [43] produced models with the highest cross-validation accuracy, regardless of the exact way model accuracy was computed (Additional file 1: Figure S17). We thus used the modeling accuracy of this method as a proxy of how indicative of gene expression the methylation features are. Based on the AUC measure (area under the receiver operator characteristic curve), the accuracy of the one-class-against-all models for the four expression classes ranged from 0.63 to 0.82 (Additional file 1: Figure S18), where a random assignment of genes to expression classes would result in an AUC value of 0.5, indicating that the methylation features were able to partially separate genes from different expression classes. Among the four expression classes, the Lowest expression class had the highest accuracy, followed by the Highest, Medium-high and Medium-low classes. These results are consistent with what we observed from the scatterplots, that many genes with the lowest expression levels have very high methylation patterns, which can separate them from genes with higher expression levels. The genes with the highest expression levels are slightly more difficult to identify since their signature of low methylation is also shared by many genes from other expression classes. Lacking clear signatures from DNA methylation levels alone, genes in the two medium expression classes are most difficult to identify. The same trends were observed when we repeated the analysis with all four DNA methylation quantification measures and a wide range of expression class numbers (from 2 to 64, Additional file 1: Figures S19–S22).
Gene body methylation is a stronger indicator of expression class than promoter methylation
We then compared the models constructed using features from either the upstream regions, gene bodies or downstream regions alone (Figure 4). Methylation levels at gene bodies were more capable of telling the expression class of a gene than upstream and downstream regions, for all four expression classes. Combining features from all sub-regions gave the best modeling accuracy, which shows that the features from the different sub-regions are not totally redundant, and may play different roles in gene regulation. These observations stay true for all four methylation quantification measures (Figure 5 and Additional file 1: Figure S23). Comparing the modeling accuracy of the four methylation measures, none of them is clearly better than the others, although on average mCG/CG/len had a slightly higher accuracy.
A potential confounding factor of the above analyses is that the upstream and downstream regions of a transcript could overlap with the body of another transcript [32]. For instance, for a multi-exon gene, if it has a transcript that does not involve the first exon of the gene, DNA methylation at the promoter of the transcript would be counted as gene body methylation of the gene, which may confuse the statistical models. To study how much this would affect the results, we repeated the statistical modeling using the subset of genes with only one annotated transcript isoform. Comparing the resulting models based on different feature sets (Additional file 1: Figure S24), gene bodies still showed stronger modeling power than upstream and downstream regions, and the best accuracy was still obtained by combining features from all three sub-regions.
It was previously shown that DNA methylation of the first exon is linked to transcriptional silencing [44]. We checked whether the higher modeling accuracy of gene body feature was merely due to some strong features extended from the promoter to the first exon. Specifically, we considered two more sub-regions, namely gene bodies excluding the first exons (Genebody–FirstEx) and upstream regions including the first exons (Upstream+FirstEx). We observed that including the first exon in the upstream regions (Upstream+FirstEx) or gene bodies (Genebody) indeed increased the modeling accuracy as compared to having it excluded (Upstream and Genebody–FirstEx, respectively), thus confirming the important role of this sub-region in signifying expression levels (Additional file 1: Figure S25). On the other hand, when we compared upstream and gene body regions, we found that the modeling accuracy of Genebody–FirstEx was higher than Upstream, and that of Genebody was higher than Upstream+FirstEx when all annotated genes were considered (Additional file 1: Figure S25). The same trends were also observed when only genes with one annotated transcript isoform were considered (Additional file 1: Figure S24), except for a slightly higher accuracy of Upstream than Genebody–FirstEx when the mCG/len methylation measure was used. Altogether, our results show that in general, DNA methylation at gene bodies is a stronger indicator of the expression class than DNA methylation at promoters, and it is neither due to overlapping definitions of promoters and gene bodies for multi-transcript genes, nor signals coming from the first exon only.
We also compared the modeling accuracy of exons and introns. For all four quantification measures, methylation levels at exons were consistently a better indicator of expression than methylation levels at introns (Additional file 1: Figure S25), but again the modeling accuracy was higher when both types of features were considered than when either one was used alone.
To test if the above observations are sensitive to the way we define expression classes, we also used a second way to divide genes into four expression classes covering equal range of log-expression values. The results (Additional file 1: Figure S26) show that all the main observations discussed above remain unaffected.
Quantitative relationships between promoter and gene body methylation
Since both promoter and gene body methylation are indicative of gene expression to a certain extent, we next explored whether they carry redundant information. When plotting the DNA methylation levels at these two regions for all genes, the distributions based on the four quantification measures were found to be very different (Additional file 1: Figure S27). An L-shaped pattern was observed for mCG (Additional file 1: Figure S27a) and less obviously for mCG/len (Additional file 1: Figure S27c), but not for the other two measures (Additional file 1: Figure S27b and d). Notably, when mCG/CG was used for quantification, the genes were divided into two large clusters (Additional file 1: Figure S27b). Both clusters display very high level of gene body methylation, but one with very high and the other with very low promoter methylation. We also created scatterplots for studying the relationships between the length, the number of CpGs, and the number of methylated CpGs in each sub-region, for each of the 16 types of sub-regions (Additional file 1: Figures S28–S30). The scatterplots between number of CpGs and number of methylated CpGs reveal some interesting patterns about the two clusters in the mCG/CG plot (Additional file 1: Figure S29). For most gene body sub-regions except FirstEx and to some degree LastEx, the genes form a straight line along the diagonal line CG = mCG, showing that the different genes actually have different absolute number of CpGs at their gene bodies, but most of their internal exons and internal introns are fully methylated. In contrast, for the upstream and downstream sub-regions, as well as the first exon, the genes form a tilted V-shaped pattern, with a group of genes lying close to the diagonal CG = mCG and another group lying close to the vertical axis mCG = 0, which correspond to the extreme cases with fully methylated and fully unmethylated CpGs.
To gain more insights into the relationships between promoter and gene body methylation, we included in our analysis the expression levels of the genes (Additional file 1: Figure S31). The three-dimensional scatterplot based on the mCG measure displays the sharpest pattern among the four plots (Additional file 1: Figure S31a), which shows a “triple-inverse” relationship between promoter methylation, gene body methylation and gene expression. This triple-inverse relationship indicates that a gene can either have a high promoter mCG level, a high gene body mCG level, or a high expression level, but not two or three of them simultaneously. This relationship between the three quantities is consistent with the L-shaped patterns we previously observed in the 2D plots (Additional file 1: Figures S8a, S9a and S27a). These results suggest that in terms of the absolute number of methylated CpG sites, either strong promoter methylation or strong gene body methylation alone is sufficient to indicate low expression, and it is not required for a gene to redundantly have both indicators.
Potential role of gene body methylation for genes with CpG-poor promoters
It has been proposed that for CpG island promoters, DNA methylation is a sufficient but not necessary condition for gene inactivation, while for CpG-poor promoters, DNA methylation does not preclude expression [19]. To check whether the same observations could be made in our data, we plotted the expression level of different groups of genes according to their promoter CpG levels (Figure 6A and B). Indeed, the expression levels of genes with a large number of CpG dinucleotides in their promoter regions were more strongly affected by the DNA methylation in these regions. Specifically, for both mCG and mCG/CG measures, promoter methylation was more anti-correlated with gene expression for genes with highest or medium promoter CpG levels (first two bar sets of the figures) than those with lowest promoter CpG levels (last bar sets of the figures). Genes with lowest promoter CpG levels were largely insensitive to promoter methylation, and had low expression levels in general.
For this group of genes with CpG-poor promoters, can gene body methylation indicate their expression levels? To answer this question, we again divided genes into three groups according to their promoter CpG counts, but this time we studied the correlation between gene body methylation and expression levels of each group instead (Figure 6C and D). For both mCG and mCG/CG, the genes with CpG-poor promoters do exhibit some weak differential expression patterns as gene body methylation level varies, but the correlation between gene body methylation and expression was positive for mCG and negative for mCG/CG. These results suggest a potential role of gene body methylation in regulating genes with CpG-poor promoters, although the exact mode of regulation is yet to be understood.
Generality of the quantitative models
All the results above were based on quantitative models both constructed and tested on the same individuals (albeit on different subsets of genes), using data from one single cell type (PBMC). To test if these models are generally useful for signifying expression classes, we collected single-base resolution bisulfite sequencing and RNA-seq data for two cell lines, H1 human embryonic stem cells (hESC) and the human lung fibroblast line IMR90, from the Roadmap Epigenomics Project [45] (Additional file 1: Table S3). We constructed models using DNA methylation and expression data from one individual/cell line, and applied the models to predict the expression class of genes in another individual/cell line based on its DNA methylation profile alone. To ensure the generality of the models, the genes used for training in the first individual/cell line and the genes used for testing in the second individual/cell line were mutually exclusive.
The results (Figure 7) show that, for all combinations of training and testing individuals/cell lines, the prediction accuracy was much higher than random predictions (which would have an AUC value of 0.5). Models constructed from any one of the three individuals were able to predict the expression classes of genes in another individual with an average AUC of about 0.9, which is expected as these samples all contained PBMC from individuals in the same family. More interestingly, the other data set combinations also have prediction accuracy of about 0.75 on average, which demonstrate the generality of the constructed models. These cross-sample results reconfirm our earlier findings that the more extreme expression classes are better indicated by methylation patterns. Moreover, among the four methylation quantification measures used, mCG, mCG/len and mCG/CG/len consistently provided better modeling accuracy than mCG/CG (Figure 7), which indicates that the commonly-used quantification measure of DNA methylation, mCG/CG, is not necessarily the best in signifying gene expression classes.
Quantitative relationship with histone modifications
Our quantitative models based on DNA methylation were able to achieve reasonable accuracy in identifying the expression class of a gene, but they also show that DNA methylation alone is not informative enough to signify precise expression levels. We have previously shown that histone modifications are strong indicators of expression levels [46],[47]. Therefore, we next explored the relationship between DNA methylation and histone modifications in terms of indicating gene expression, and tested whether information on gene expression conveyed by DNA methylation is totally subsumed by that of histone modifications. It was previously shown that promoter methylation was negatively correlated with H3K4me3 (histone 3 lysine 4 trimethylation) in the human brain [32], and gene body methylation was positively correlated with H3K36me3 and negatively correlated with H3K27me3 in a B-lymphocyte cell line [28]. To study the quantitative relationships between DNA methylation and histone modifications in the context of indicating expression levels, we compared statistical models that involve either only DNA methylation features, only histone modification features, or both.
We collected ChIP-seq data for 26 types of histone modification from the H1 embryonic cell line from the Roadmap Epigenomics Project (Additional file 1: Table S3). As with DNA methylation, we computed the average signal of each type of histone modification in the same 16 sub-regions for each gene. Although some histone marks are known to be enriched in particular sub-regions, this knowledge is limited to some well-studied types of histone modifications. We therefore considered all sub-regions and let the Random Forest method identify the features most useful for indicating expression levels.
As expected, some of the models constructed from histone modification features alone had high cross-validation accuracy (Figure 8). Consistent with previous findings, the two strongest feature sets were H3K36me3 and H3K4me3, which mark actively transcribed regions and active promoters, respectively [48]. Models based on DNA methylation features alone were not as accurate as those constructed from these histone modification features well-known for their roles in marking gene activities, but were more accurate than many other types of histone modification such as H3K9me3 and H3K4me1 (Figure 8).
DNA methylation and histone modifications contain non-redundant information about gene expression
Interestingly, regardless of the type of histone modification and the DNA methylation measure used, combining both types of features consistently increased the accuracy of the corresponding models involving only histone modification features or only DNA methylation features. Even for the strongest histone modification feature set derived from H3K36me3, incorporating DNA methylation features still led to an improvement of modeling accuracy by about 6%, from AUC value of 0.83 to 0.88 for mCG/CG/len, which indicates that the two types of signals were not completely redundant in terms of signifying gene expression.
To better understand how DNA methylation complements histone modification in indicating expression classes, we examined the DNA methylation and H3K36me3 signal levels of two types of genes, namely (1) those with expression classes correctly identified by the model involving only mCG/CG/len features but not by the model involving only H3K36me3 features, and (2) the vice versa, i.e., those with expression classes correctly identified by the H3K36me3 model but not the mCG/CG/len model. The genes with expression classes correctly identified by the mCG/CG/len model only displayed higher mCG/CG/len levels (Figure 9A, blue lines and areas) and lower H3K36me3 levels (Figure 9B), indicating that in general they were the less transcribed genes. Among the different sub-regions, as expected the ones best separating the two groups of genes in terms of H3K36me3 signals were those within the gene bodies, and to a lesser extent those at downstream regions (Figure 9B). Interestingly, in terms of mCG/CG/len levels, the sub-regions that best separate the two groups of genes were the exonic regions, especially the first exon (Figure 9A), indicating that methylation levels at exonic regions not only play crucial roles in models involving DNA methylation features alone, but could also be important in complementing histone modifications in indicating the expression class of a gene.
As in the case of DNA methylation, histone modification features were most successful in identifying genes with lowest expression levels (Additional file 1: Figure S32). However, even the strongest histone modification features were not significantly better than DNA methylation in identifying these genes. In contrast, some of them were much better in identifying genes with medium expression levels, suggesting that DNA methylation mainly indicates the coarse on/off status of a gene, while some histone marks provide more fine-grained details about the precise expression levels.
We examined the relationships between DNA methylation and histone modifications in more detail by plotting their values in different sub-regions of genes (Additional file 1: Figures S33–S34). In particular, we reconfirmed previous findings that DNA methylation and H3K4me3 negatively correlate at the upstream region (Figure 10). However, whether gene body methylation positively or negatively correlates with H3K36me3 depends on the DNA quantification measure (Figure 11), with the correlation being most positive for mCG/len, and most negative for mCG/CG.
A small number of DNA methylation and histone modification features are sufficient to maximally indicate gene expression
When we combined features derived from DNA methylation and all 26 types of histone modifications, the resulting model had a higher accuracy than all the models involving single histone modification and/or DNA methylation features (Figure 8). To test if it is possible to achieve the same accuracy with a smaller number of feature sets, we applied a forward feature selection procedure. Specifically, we started with either an empty set of features, or all DNA methylation features based on one quantification measure. We then iteratively added the set of features for the type of histone modification that could maximize the accuracy gain, until no more sets could lead to any further improvements. Depending on the DNA methylation features included in the first step, maximal accuracy was achieved by 6-8 feature sets in total (Additional file 1: Figure S45).
Consistent with the single-feature-set results, H3K36me3 and H3K4me3 were always the features first incorporated into the models. The features next incorporated include those that involve H3K79, and the repressive mark H3K27me3. For the DNA methylation measures mCG, mCG/CG and mCG/CG/len, including DNA methylation features resulted in final models with higher accuracy than the one involving histone modification features alone, indicating that DNA methylation has non-negligible roles in these models with maximal modeling accuracy.
Since the AUC values were increased most by H3K36me3 and H3K4me3, and these two marks are well-known to be most indicative of expression levels, we believe similar results would be obtained if we had applied other feature selection methods.