Gene set enrichment analysis for genome-wide DNA methylation data

DNA methylation is one of the most commonly studied epigenetic marks, due to its role in disease and development. Illumina methylation arrays have been extensively used to measure methylation across the human genome. Methylation array analysis has primarily focused on preprocessing, normalization, and identification of differentially methylated CpGs and regions. GOmeth and GOregion are new methods for performing unbiased gene set testing following differential methylation analysis. Benchmarking analyses demonstrate GOmeth outperforms other approaches, and GOregion is the first method for gene set testing of differentially methylated regions. Both methods are publicly available in the missMethyl Bioconductor R package. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02388-x.

No. genes mapped to a CpG B
2,300 genes are down-regulated and ~2500 are up-regulated in B-cells, compared to NK cells. (E) Log2 intensity distributions of the raw data for all B-cell development samples. (F) Multidimensional scaling plot of the normalised and filtered B-cell development Affymetrix array data. (G) Numbers of differentially expressed genes with an adjusted p-value < 0.01, for Stage 1 versus Stage 2 of B-cell development. The blue bar is the number of significantly down-regulated genes, and the red bar is the number that are significantly up-regulated; e.g. ~400 genes are down-regulated and ~500 are up-regulated in Stage 1, compared to Stage 2.

cells vs. CD8 T-cells (GO). (A)
Cumulative number of GO terms, as ranked by various methods, that are present in each truth set for the CD4 T-cells vs. CD8 T-cells comparison. ISP Terms = immune-system process child terms truth set; RNAseq Terms = top 100 terms from RNAseq analysis of the same cell types. (B) Bubble plots of the top 10 GO terms as ranked by various gene set testing methods for the CD4 T-cells vs. CD8 T-cells comparison. The size of the bubble indicates the relative number of genes in the set. The colour of the bubble indicates whether the term is present in either RNAseq (purple) or ISP (green) truth sets, both (red) or neither (blue). ebGSEA (KPMT) = ebGSEA using Known Population Median Test; ebGSEA (WT) = ebGSEA using Wilcoxon Test; GOmeth = GOmeth using either FDR < 0.05 (for contrasts with <5000 significant CpGs) or top 5000 most significant probes; HGT = hypergeometric test; mGLM = methylglm; mRRA (GSEA) = methylRRA using gene set enrichment analysis; mRRA (ORA) = methylRRA using over-representation analysis. (A) Cumulative number of GO terms, as ranked by various methods, that are present in each truth set for the monocytes vs. neutrophils comparison. ISP Terms = immune-system process child terms truth set; RNAseq Terms = top 100 terms from RNAseq analysis of the same cell types. (B) Bubble plots of the top 10 GO terms as ranked by various gene set testing methods for the monocytes vs. neutrophils comparison. The size of the bubble indicates the relative number of genes in the set. The colour of the bubble indicates whether the term is present in either RNAseq (purple) or ISP (green) truth sets, both (red) or neither (blue). ebGSEA (KPMT) = ebGSEA using Known Population Median Test; ebGSEA (WT) = ebGSEA using Wilcoxon Test; GOmeth = GOmeth using either FDR < 0.05 (for contrasts with <5000 significant CpGs) or top 5000 most significant probes; HGT = hypergeometric test; mGLM = methylglm; mRRA (GSEA) = methylRRA using gene set enrichment analysis; mRRA (ORA) = methylRRA using over-representation analysis.

Evaluation of the performance of GOregion for CD4 T-cells vs. CD8 T-cells. (A)
Bias plot showing that genes that have more CpGs measuring methylation are more likely to have a differentially methylated region. This plot is produced from EPIC array sorted blood cell type data, comparing CD4 T-cells vs. CD8 T-cells. (B) Cumulative number of GO terms, as ranked by GOregion and a simple hypergeometric test (HGT), that are present in each truth set for the CD4 T-cells vs. CD8 T-cells comparison. ISP Terms = immune-system process child terms truth set; RNAseq Terms = top 100 terms from RNAseq analysis of the same cell types. (C) Bubble plots of the top 10 GO terms as ranked by GOregion and a simple HGT for the CD4 T-cells vs. CD8 T-cells comparison. The size of the bubble indicates the relative number of genes in the set. The colour of the bubble indicates whether the term is present in either RNAseq (purple) or ISP (green) truth sets, both (red) or neither (blue). (D) Upset plot showing the characteristics of the CpGs selected as "significant" for the CD4 T-cells vs. CD8 T-cells comparison by a probe-wise differential methylation analysis using a significance cut off (FDR < 0.05), the top 5000 CpGs as ranked by the probe-wise analysis (Top 5000) or the CpGs underlying the filtered DMRcate regions (DMRcate). The probe-wise analysis with FDR < 0.05 identified over 8,000 CpGs as "significant" and had the most unique CpGs. Although DMRcate identified the fewest "significant" CpGs (~3,700), over 2,000 were unique to that approach. (E) Proportion of "significant" CpGs that are annotated to genes as identified by the three different strategies. (F) Upset plot showing the characteristics of the genes that "significant" CpGs are annotated to, as identified by the three different strategies, for the CD4 T-cells vs. CD8 T-cells comparison. CpGs identified by the probe-wise analysis with FDR < 0.05 map to over 3,000 genes. The CpGs identified by DMRcate mapped to ~600 genes, several of which are unique to this approach.

neutrophils. (A)
Bias plot showing that genes that have more CpGs measuring methylation are more likely to have a differentially methylated region. This plot is produced from EPIC array sorted blood cell type data, comparing monocytes vs. neutrophils. (B) Cumulative number of GO terms, as ranked by GOregion and a simple hypergeometric test (HGT), that are present in each truth set for the monocytes vs. neutrophils comparison. ISP Terms = immune-system process child terms truth set; RNAseq Terms = top 100 terms from RNAseq analysis of the same cell types. (C) Bubble plots of the top 10 GO terms as ranked by GOregion and a simple HGT for the monocytes vs. neutrophils comparison. The size of the bubble indicates the relative number of genes in the set. The colour of the bubble indicates whether the term is present in either RNAseq (purple) or ISP (green) truth sets, both (red) or neither (blue). (D) Upset plot showing the characteristics of the CpGs selected as "significant" for the monocytes vs. neutrophils comparison by a probe-wise differential methylation analysis using a significance cut off (FDR < 0.05), the top 5000 CpGs as ranked by the probe-wise analysis (Top 5000) or the CpGs underlying the filtered DMRcate regions (DMRcate). The probe-wise analysis with FDR < 0.05 identified over 21,000 CpGs as "significant" and had the most unique CpGs. Although DMRcate identified fewer CpGs (~7500), 3500 of them were unique to that approach. (E) Proportion of "significant" CpGs that are annotated to genes as identified by the three different strategies. (F) Upset plot showing the characteristics of the genes that "significant" CpGs are annotated to, as identified by the three different strategies, for the monocytes vs. neutrophils comparison. CpGs identified by the probe-wise analysis with FDR < 0.05 map to over 6,000 genes. The CpGs identified by DMRcate mapped to ~1200, several of which are unique to this approach.

Fig. S13. Effect of DMR filtering on gene set testing for sorted blood cell types data. (A)
Cumulative number of GO terms, as ranked by GOregion analysis of DMRs filtered using various parameters, that are present in the immune truth set for all comparisons. ISP Terms = immune-system process child terms truth set. (B) Cumulative number of GO terms, as ranked by GOregion analysis of DMRs filtered using various parameters, that are present in the RNAseq truth set for all comparisons. RNAseq Terms = top 100 terms from RNAseq analysis of the same cell types. | | = mean methylation difference across region; No. CpGs = number of CpGs ∆β underlying region. showing that genes with more measured CpGs are more likely to have a differentially methylated region (DMR). This plot is produced from 450K array B-cell development data, comparing Stage 1 vs. Stage 2. (B) Cumulative number of GO terms, as ranked by GOregion and a simple hypergeometric test (HGT), that are present in each truth set for the Stage 1 versus Stage 2 comparison. ISP Terms = immune-system process child terms truth set; Array Terms = top 100 terms from Affymetrix array analysis of the same B-cell development stages. (C) Bubble plots of the top 10 GO terms as ranked by GOregion and a simple HGT for the Stage 1 versus Stage 2 comparison. The size of the bubble indicates the relative number of genes in the set. The colour of the bubble indicates whether the term is present in either Array (purple) or ISP (green) truth sets, both (red) or neither (blue). (D) Upset plot showing the characteristics of the CpGs selected as "significant" for the Stage 1 versus Stage 2 comparison by a probe-wise differential methylation analysis using a significance cut off (FDR < 0.05), the top 5000 CpGs as ranked by the probe-wise analysis (Top 5000) or the CpGs underlying the filtered DMRcate regions (DMRcate). The DMRcate analysis identified over 11,000 CpGs as "significant" and had the most unique CpGs (>7500). (E) Proportion of "significant" CpGs that are annotated to genes as identified by the three different strategies. (F) Upset plot showing the characteristics of the genes that "significant" CpGs are annotated to, as identified by the three different strategies, for the Stage 1 versus Stage 2 comparison. Cumulative number of GO terms, as ranked by various analysis strategies, that are present in each truth set for Stage 1 vs. Stage 2. ISP Terms = immune-system process child terms truth set; Array Terms = top 100 terms from Affymetrix array analysis of Stage 1 vs. Stage 2. GOmeth (5000) = GOmeth using top 5000 most significant probes; GOmeth (FDR < 0.05) = GOmeth using all probes significant at FDR < 0.05; GOregion = GOregion using all selected region coordinates. (B) Bubble plots of the top 10 GO terms as ranked by various analysis strategies for the Stage 1 vs. Stage 2 comparison. The size of the bubble indicates the relative number of genes in the set. The colour of the bubble indicates whether the term is present in either Array (purple) or ISP (green) truth sets, both (red) or neither (blue). (C) Effect of DMR filtering on gene set testing. Cumulative number of GO terms, as ranked by GOregion analysis of DMRs filtered using various parameters, that are present in each truth set for the Stage 1 vs. Stage 2 comparison. ISP Terms = immune-system process child terms truth set. Array Terms = top 100 terms from Affymetrix array analysis of the same B-cell development stages. | | = mean methylation difference across region; No. CpGs = number of ∆β CpGs underlying region.