Previous studies have examined high-level qualitative relationships between DNA methylation and gene expression. In this work, we have demonstrated that DNA methylation status alone can indicate the expression class of a gene with fairly high accuracy. The generality of our models has been confirmed by their cross-sample/cell line modeling capability. Our models provide a means to analyze the detailed quantitative relationships between DNA methylation and expression, with systematic assessments of the level of expression variations explainable by DNA methylation.
We showed that two groups of genes have particularly clear methylation profiles in our data, namely genes that lie on both ends of the spectrum – those with very high methylation and very low expression levels, and those with very high expression and very low methylation levels. If we apply a simple classification of genes into those with high or low expression and DNA methylation levels, among the four possible combinations, the one with both high expression and high DNA methylation is almost devoid of genes when three out of the four DNA methylation quantification measures were used. The resulting scatterplots exhibit clear L-shaped patterns (Figure 2), which correspond to an exclusive OR (XOR) relationship between DNA methylation and gene expression. Our results indicate that on the one hand, strong DNA methylation is sufficient to indicate low expression of a gene, but on the other hand, while low DNA methylation is permissive of transcription, the actual expression level of a gene is largely determined by other factors.
We further demonstrated that one class of such factors is histone modification. Some types of histone modification, including H3K4me3 and H3K36me3, are much stronger indicators of precise expression levels of individual genes than DNA methylation. However, we found that incorporating DNA methylation features consistently improved the modeling power of the models involving either of these histone marks alone, or even the one involving all types of histone modification combined (Figure 8). Notably, we found that DNA methylation levels at exonic regions helped determine the expression class of some genes in our models when H3K36me3 features failed to do so.
A key finding of this study is that gene body methylation is a stronger indicator of expression class than promoter methylation for genes in all expression classes. Our results are consistent with the strong effects of gene body methylation on expression previously observed in plants [49,50]. We provided evidence that the stronger modeling power of gene body methylation could not be explained by the effects of first exons alone or biases caused by the presence of multiple transcript isoforms in a single gene, nor was it affected by the quantification measure of DNA methylation levels. We also found that combining both promoter and gene body DNA methylation features resulted in a better modeling accuracy of gene expression classes. The “triple-inverse” pattern observed between promoter methylation, gene body methylation and gene expression (Additional file 1: Figure S31a) suggests that promoter and gene body methylation exert repressive effects on different sets of genes. Previous studies have proposed that promoter methylation is linked to blockage of transcription factors, while gene body methylation is related to the recruitment of transcriptional repressors and reduction of transcriptional elongation [1,9,31]. The potentially divergent roles of DNA methylation at the two types of regions are consistent with the higher modeling accuracy achieved in our study when both types of features were considered. Since the on/off role of promoter methylation appears to affect a relatively small set of genes with extreme methylation levels, we speculate that the effect of gene body methylation on reducing transcription efficiency may be a more general mechanism that affects a broader group of genes, which provides a plausible explanation for the stronger modeling power of the gene body methylation features in our current study.
We propose that the different functions of DNA methylation in transcriptional regulation are better reflected by multiple quantification measures rather than one single measure. It is possible that the raw number of methylated CpG sites, mCG, is a proxy of the total time of an elongating polymerase being slowed down by gene body methylation. Another quantification measure, the number of methylated CpG sites per unit length, mCG/len, may be more related to the average speed reduction of the elongating polymerase. Finally, the commonly-used density measure mCG/CG represents a comparison between methylated and unmethylated CpG sites in a given genomic region, which may reflect the “competitiveness” of the region for events such as protein binding. In this study, we demonstrated that these quantification measures used to represent methylation levels at a given genomic region could exhibit drastically different patterns when analyzed together with gene expression and histone modification signals. However, all of them were able to model expression classes reasonably well and none of them was clearly better than all the others. Further investigations are needed to study the detailed mechanistic meanings of these different quantification measures.
Our results offer several possible explanations for the apparent discrepancies among previous studies examining the relationships between gene body methylation and gene expression, that in some studies they were observed to be positively correlated [31,32] and in others, negatively correlated [28,29,51,52]. We found that substantially different correlation values could be obtained by using different quantification measures of DNA methylation, and different ways to compute the correlations. For example, whereas rank-based Spearman correlation is more strongly affected by the bulk of genes with low methylation and expression levels as they occupy a wide range of rank values, value-based Pearson correlation is more influenced by genes with more extreme methylation and expression levels. Calculating correlations using different subsets of genes, such as all genes versus only those with observable expression values, could also lead to very different conclusions. The discrepancies in the previous studies could be due to these and other subtle data processing and analysis details.
Further studies will be needed to elucidate how promoter and gene body methylation of different transcripts of a gene are coordinated. Signals that cover a broad area, such as DNA methylation over whole transcript bodies, have a high chance of interfering with other transcripts. The coordination would be simplest if promoter and gene body methylation both take on a repressive role, and different transcript isoforms of a gene co-express in a synchronized manner. In that case, DNA methylation would be mainly responsible for marking genes with all transcripts repressed. The co-expression of transcript isoforms was indeed observed in large-scale sequencing data from human cells , although it is still unclear whether the different isoforms expressed simultaneously in the same cell, or actually different subsets of them were expressed in different sub-populations of the cells from which RNA was extracted and sequenced. Alternatively, intragenic DNA methylation that intersects promoters of some transcripts may be involved in regulating the use of alternative promoters . Whether other, more complex types of coordination exist is yet to be studied.
Our study of the quantification measures at different genic sub-regions was facilitated by whole-genome bisulfite sequencing data at single-base resolution. Some other experimental methods produce data at a lower resolution (such as ChIP-based or affinity-capture-based methods), have incomplete genome coverage, especially at transcribed regions (such as some array-based methods), or provide information for only some types of DNA methylation quantification. Nevertheless, whole-genome bisulfite sequencing has a relatively high cost, and it requires extensive computations in data processing. For practical purposes, it would be crucial to choose a suitable experimental method based on the goal of the study. For example, methylation profiles are obtained from case and control samples in some disease studies, to identify differentially methylated regions with functional consequences. Our results offer new insights into choosing the suitable experimental method by indicating that for the vast majority of genes with moderate or low methylation levels, their expression levels are only weakly reflected by methylation levels, but more strongly affected by other factors. Therefore, if one is to make hypotheses based on the methylation data alone, it is more reasonable to consider only genes with extreme methylation levels. These extreme cases can probably be identified using low-resolution data with partial genome coverage. In contrast, if one wants to identify all differentially methylated genes for downstream experimental testing of their functional effects, data with higher resolution can probably provide more details about subtle differences that exist among the various samples. Additionally, it has recently been proposed that methylation at distal enhancer sites may cause differential gene expression in disease samples , the study of such phenomena would better be conducted using data with whole-genome coverage.