Public datasets
In this study, we used a total of 4174 genome wide datasets (Additional file 1: Table S1), including 65 WGBS profiles, 449 Infinium 450 K arrays, 3660 RNA-seq data from Roadmap Epigenomics [46], ENCODE [47], and the TCGA consortium [48], respectively. The TFs were downloaded from Human TF repertoire [49]. The homeobox genes were downloaded from HomeoDB2 database [29]. A total of 426 cancer dominant genes (oncogenes) and 128 cancer recessive genes (tumor suppressor gene) were defined by the COSMIC [31] database. The pan-cancer oncogenic signatures including CNV deletion (116) and amplification (151), somatic mutation (199), and DNA methylation (13) were identified by a hierarchical classification method of 3299 TCGA tumors from 12 cancer types [45]. Also, a human genome-wide enhancer cluster was obtained from the ChIP-seq datasets of H3K27ac peaks in 88 human cell types [25]. Gene expression values of normalized read counts by expectation-maximization (RSEM) from RNA-seq data of primary tumor and normal samples were obtained from the TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/tcga/) including 19 bladder normal and 408 urothelial carcinomas (BLCA), 113 breast normal and 1102 invasive carcinomas (BRCA), 59 lung normal and 515 adenocarcinomas (LUAD), 102 lung squamous normal and 502 carcinomas (LUSC), 35 stomach normal and 415 adenocarcinoma (STAD), 22 uterine normal and 437 corpus endometrial carcinoma (UCEC). Promoters were defined from 1 kb upstream to 500 bp downstream of RefSeq transcription start sites (TSS) and gene-bodies were defined from 500 bp downstream of RefSeq TSS to RefSeq transcription termination sites (TTS).
Identification of reference under-methylated regions
We developed a comprehensive statistical framework to identify human reference UMRs from 65 high-quality WGBS profiles (genome-wide CpG coverage percentage > 90%; Additional file 1: Table S1), comprising 30 normal tissues and 35 primary solid tumors:
Step 1: For each WGBS profile, we used BSMAP [50] to trim adaptor, low-quality, and duplicated sequence with default threshold, aligned bisulfite-treated reads to the human genome (hg19). We used the coverage threshold of 4 reads to ensure the accuracy of CpG methylation detection [51]. The methylation ratio of each CpG covered with at least 4 reads was calculated by the module bsratio in BSMAP.
Step 2: The UMRs were identified that include at least four consecutive hypomethylated CpGs with the mean methylation ratio < 10% as described previously [20]. To reduce the effect of sparse CpG density in our UMR detection based on HMM model, we removed UMRs with Obs/Exp value of CpGs < 0.1.
Step 3: A total 3,521,985 redundant UMRs from multiple tissue and tumor WGBS profiles were reduced to 369,852 non-redundant ones through merging the overlapping UMRs among multiple samples. To describe genome-wide UMR enrichment distribution across tissue and tumor samples, the UMR frequency (UOF) of the intersect segment (s) among N samples was defined as\( UOF(s)=\frac{\sum_{i=1}^N{s}_i}{N}, \)
$$ where\ {s}_i=\left\{\begin{array}{c}1,\kern0.5em if\kern0.5em the\ ith\ sample\ UMR\ covering\ segment\ s\\ {}0,\kern0.5em otherwise\kern17.25em \end{array}\right.. $$
UMR occupancy scores represent the UMR co-occupancy level of population-scale samples of normal tissues and tumors. The higher the UOF, the more conserved are the UMR in the population-scale samples. Conversely, a UOF decrease represents the UMR shortening or loss, suggesting that hypermethylation occurs in these regions at a population scale.
Step 4: Inspired by ChIP-seq peak calling for detection of significantly enriched regions, we detected reference UMRs from UOF profile within population of samples based on a Poisson test (p values < 1.0e-8), p values adjusted by the Benjamini and Hochberg (BH) method. These reference UMRs were identified for normal tissue (32,864) and tumors (45,081), respectively. A total of 46,384 recurrent UMRs were identified through combining the normal and tumor reference UMRs (Additional file 2: Table S2).
Identification of pan-cancer differentially methylated UMRs
We sought to uncover common patterns of aberrant DNA methylation across diverse tumor types with low heterogeneity among normal tissues. A statistical framework was devised to identify pan-cancer differentially methylated UMRs.
Step 1: Normal tissue-specific UMRs were removed using a quantitative method QDMR [26] based on Shannon entropy with default threshold. The lower the entropy value, the bigger the difference of DNA methylation across sample. In total, 24,098 reference UMRs with low heterogeneity across normal samples were retained.
Step 2: The differential methylation (DM) analysis was performed by employing a likelihood ratio test method to dissect aberrant methylation between tumor and normal samples. The mean methylation level of the ith UMR in the jth normal sample is denoted as \( {x}_{ij}^0 \), while the methylation level in the kth tumor sample is represented as \( {x}_{ik}^1. \) Here \( {x}_{ij}^0\sim Beta\left({\alpha}_i^0,{\beta}_i^0\right),\kern0.5em i\in \left[1,2,..,N\right],\kern0.5em j\in \left[1,2,..,{M}^0\right] \), N is the total number of UMRs and M0 is the number of the normal samples. In addition, \( {x}_{ik}^1\sim Beta\left({\alpha}_i^1,{\beta}_i^1\right),k\in \left[1,2,..,{M}^1\right]\kern0.5em \)and M1 is the number of the tumor samples. Then, the goal of testing if the ith UMR is differential across tumor and normal samples is to determine if they have the same distribution parameters. This is equivalent to test the following hypothesis
$$ {H}_o:{\alpha}_i^0={\alpha}_i^1={\alpha}^s\ and\ {\beta}_i^0={\beta}_i^1={\beta}^s\kern0.5em vs $$
$$ {H}_1:{\alpha}_i^0\ne {\alpha}_i^1\ or\ {\beta}_i^0\ne {\beta}_i^1 $$
To this end, a likelihood ratio test for is adopted, whose test statistics are expressed as:
$$ {D}_i=-2\mathit{\ln}\prod \limits_{j=1}^{M^0+{M}^1}P\left({x}_{ij}|{\alpha}^s,{\beta}^s\right)+\mathit{\ln}\prod \limits_{k=1}^{M^0}P\left({x}_{ik}|{\alpha}^0,{\beta}^0\right)+\mathit{\ln}\prod \limits_{l=1}^{M^1}P\left({x}_{il}|{\alpha}^1,{\beta}^1\right) $$
Here, Di approximately follows a χ2 distribution with degree of freedom df2 − df1 under Ho, from which the p value can be computed as
$$ {p}_{value}=1-{\chi}^2\left({D}_i, df2- df1\right) $$
where df2 and df1 represent the degrees of freedom for the model under H1 and Ho, which are 4 and 2, respectively. In the end, the p values for all the UMRs are adjusted to false discovery rate (FDR) using the BH method. The absolute DM values of UMRs were defined as the difference of mean methylation levels between tumor and normal samples. Both p value adjusted by BH method < 0.001 and DM value > 0.1 were used to identify the differentially methylated UMRs relative to all normal samples (Additional file 3: Table S3). To compare these differentially methylated UMRs with conserved UMRs across tumor types, we established two control groups: (1) 1398 conserved canyons; and (2) 9596 conserved cUMRs, which are not differentially methylated in all of the seven tumor types.
Gene expression analyses
Differentially expressed m-homeobox genes were identified using the software edgR [52] with FDR-adjusted P value < 0.01 and relative fold changes of mean expression level > 2 (tumor vs norm).
HumanMethylation450 BeadChip analysis
We selected a large cohort of Infinium Human Methylation 450 K BeadChip data for Uterine Corpus Endometrial Carcinoma (UCEC), including 22 normal and 427 primary tumor samples from TCGA (Additional file 1: Table S1). The probes with one or more single nucleotide polymorphisms (SNPs) were removed and the ComBat normalization was used to reduce the batch effect. DNA methylation levels of 482,421 CpG sites were measured as β values in the range of 0–1 that cover about 1.7% of total CpGs in the human genome. 450 K BeadChip probes are enriched in the pan-cancer hypermethylated canyons, > 90% of which include at least 10 CpGs (Additional file 4: Figure S7a). The mean beta value of 450 K BeadChip probes exhibited an almost perfect accordance (Pearson correlation coefficient ~ 0.90) with the mean methylation level using WGBS (Additional file 4: Figure S7b). Thus, the 450 K BeadChip can be used to reliably measure the methylation level of pan-cancer hypermethylated canyons.
Correlation between gene expression and locus-specific DNA methylation
Each gene (normalized into 5 kb) and the flanking 1-kb regions were split into 70 bins with a 100-bp window. Spearman’s rank correlation was computed in each bin. For a single sample, Spearman’s rank correlation coefficient was computed between gene expression and the DNA methylation level of the selected gene set at each bin. For pairwise samples (tumor vs normal), Spearman’s rank correlation coefficient was computed between gene expression changes (fold change of RSEM) and the DNA methylation level changes (absolute difference) of the selected gene set at each bin.
Gene enrichment analyses
We used DAVID [53] version 6.8 for the gene ontology analysis of pan-cancer hypermethylated canyon/cUMRs and we only plotted the GO terms with p values < 1.0e-10 with Benjamini correction. Gene enrichment significant levels for homeobox genes, tumor suppressors and oncogenes were calculated by Fisher’s exact test.
Vector construction
In order to control expression of dCas9-SunTag and scFv-DNMT3A, we acquired doxycycline-inducible open-reading frame expressing vector Pinducer 20 (P20) (Addgene 44,012) from the Thomas F. Westbrook lab and we further exchanged the selection marker of the original P20 vector from neomycin to blasticidin (P20-BSD). The sequence of dCas9-SunTag-2A-BFP and scFv-sfGFP-DNMT3A was then gateway cloned to P20 and P20-BSD, respectively. Catalytic inactive mutation (E756A) of DNMT3A was generated using agilent QuickChange II XL kit based on manufacturer’s instructions in PDONR223-scFv-sgGFP-DNMT3A and then gateway cloned to the P20-BSD vector.
DNA methylation editing using the dCas9-SunTag-DNMT3A system
Locus-specific DNA methylation of DLX1 and POU3F3 gene-bodies was conducted using our dCas9-SunTag-DNMT3A system [36]. In brief, doxycycline-inducible lentiviral particles of dCas9-SunTag-p2A-BFP and scFv-sfGFP-DNMT3A were transduced in a human embryonic kidney cell line (HEK293T). The single clones of idCas9SunTag, idCas9SunTag + iscFvDNMT3A, and idCas9SunTag + iscFvDNMT3AE756A were purified. Lentiviral particles of sgDLX1-puromycin and sgPOU3F3-puromycin were also generated and transduced in previously generated inducible dCas9-SunTag-DNMT3A cells. Transduced cells were treated with 2 μg/mL puromycin for seven consecutive days and cultured in 2 μg/mL doxycycline for another 30 days. SgRNA primers were listed as follows: DLX1-F 5’-CACCGGGCGGACTCGGAGAAGAGCA-3′, DLX1-R 5’-AAACTGCTCTTCTC CGAGTCCGCCC-3′, POU3F3-F: 5’-CACCGCGGCGGCGGGGGCGGCGCAG.
-3′, POU3F3-R: 5’-AAACCTGCGCCGCCCCCGCCGCCGC-3′.
DNA methylation analysis of targeted regions
Genomic DNA of dCas9-SunTag-DNMT3A-treated cells was extracted by Purelink Mini Kit (Invitrogen) and bisulfite converted by Epitect Bisulfite Kit (Qiagen). Promoter (chr2:172,950,395-172,950,785) and gene-body (chr2:172,951,180-172,951,400) regions of DLX1, and promoter (chr2:105,470,350-10,470,850) and gene-body (105,471,850-105,472,350) regions of POU3F3 were amplified from bisulfite-treated DNA by PCR using the following program. First, samples were heat activated at 95 °C for 5 min, then kept at 95 °C for 30 s, then at 60 °C for 2 min and 30 s and decreased by 0.2 °C every cycle, at 72 °C for 2 min and 30 s and repeated from second step for 40 cycles. Finally, the samples were elongated at 72 °C for 10 min. Bisulfite PCR primers used for promoter (P) and gene-body (E) of DLX1: DLX1-P-F 5′- GGGAAGTAGAGGAGAGAAAGTTTTA -3′, DLX1-P-R 5′- CTCTCCTCTCTTCTCTTTCTCTC -3′, DLX1-E-F: 5′- ATTTTTTTT GTAAAGGTAGGAGTTGAG -3′, DLX1-E-R 5′- AACACATACACACAATAACA CCC -3′. Bisulfite PCR products were run in 2% agarose gel electrophoresis, excised, and extracted using a gel extraction kit (Qiagen). DNA concentration of gel-extracted products was measured using qubit dsDNA HS assay kit (Life Technologies) and adjusted to 0.2 ng/μL for Nextera libraries preparation. Nextera libraries preparation was based on the manufacturer’s instructions (Illumina). We used the software of BSMAP [50] to align the paired-end reads to the human genome (hg19) and low-quality sequences were trimmed as the default threshold. High average coverage of each sample was obtained (> 2000×) and their methylation ratios of CpGs with coverage depth > 1000× were computed using the bs-ratio module in software BSMAP.
Quantitative PCR
Complementary DNA was reverse-transcripted from 1 μg RNA following the manufacturer’s instructions. Primer for qPCR: 18S-F: GTAACCCGTTGAACCCCATT, 18S-R: CCATCCAATCGGTAGTAGCG, qPCR-DLX1-F: ATGCACTGTTTACACTCGGC, qPCR-DLX1-R: GACTGCACCGAACTGATGTAG. qPCR-POU3F3-F: GCGGCTTCTAACCCCTACC, qPCR-POU3F3-R: CCCCTGCATGAAGTCGCTC. qPCR cycle conditions: 3 min at 95 °C; 40 cycles of 10 s for 95 °C,10 s for 55 °C, and 30 s for 72 °C .