Screen for systemic interindividual variation in DNA methylation
We performed deep WGBS on all 30 samples, generating ~ 1.2B 150-bp paired-end reads (~ 122 Gb uniquely mapped sequence per sample), yielding average genome-wide read depth of ~ 40× (Additional file 1: Table S1). Genetic identity of the three libraries representing each individual was verified by calling single nucleotide variant (SNV) genotypes from the WGBS data (Additional file 4: Figure S1). We analyzed CpG methylation at 100-bp resolution, focusing on all 100-bp human autosomal bins containing at least one CpG site and yielding adequate read depth (see Additional file 3 for methods). Considering all (~ 13M) such informative bins, genomic CpG methylation in the 30 samples clustered by tissue (Fig. 1b). Since combined effects across multiple neighboring CpGs are both more stable and more likely to affect gene expression [10], our goal was to identify regions of SIV. To maximize genomic coverage, we applied a two-stage approach. First, all reads from each individual were combined to calculate individual-level average methylation for each bin, and individually correlated regions of methylation (bin-bin R ≥ 0.71 (R2 ≥ 0.5)) were built in a step-wise fashion [11, 12] (see “Methods” and Additional file 3: Supplementary Methods). Then, for each such region, inter-tissue correlation (ITC) of average methylation was assessed across all tissue-type pairs. By this approach, 90% of all human autosomal bins were informative. (Sex chromosomes were not evaluated due to the limited sample size.) Regions yielding a minimum ITC ≥ 0.71 were identified as “correlated regions of SIV” (CoRSIVs) (Fig. 1c, d).
Permutation testing showed that only 39,424 (9%) of these 446,665 regions satisfy the minimum ITC criterion by chance; hence, the vast majority of the 39,424 CoRSIVs (Additional file 2: Table S2) are statistically significant. Many, however, include only a few CpGs, and some showed relatively minor interindividual variation (Fig. 1e). To target the most robust epigenetic variants likely to have the greatest biological relevance, we focused all subsequent analyses (and the remainder of this report) on the 9926 CoRSIVs containing at least 5 CpGs (Additional file 4: Figure S2a, b) and exhibiting an inter-individual methylation range of at least 20% [6, 13] (Fig. 1e; annotated list in Additional file 1: Table S3). Importantly, analysis of our permutation results showed that each of these is statistically significant (P < 0.05) (Additional file 4: Figure S2c). Unlike at genome-wide bins (Fig. 1b), methylation at these 9926 CoRSIVs clustered by individual (Fig. 1f). Notably, although our cutoff for minimum ITC was 0.71, most of the 9926 CoRSIVs exhibit minimum inter-tissue R > 0.85 (Additional file 4: Figure S2d). Analysis of DNA methylation by bisulfite pyrosequencing in an independent set of Vietnamese cadaver tissues (liver, kidney, and brain from 17 individuals) validated systemic interindividual variation at 9 of 11 loci examined (Additional file 4: Figure S3, Additional file 1: Table S4). The two that did not validate showed low interindividual variation in the Vietnamese, which might arise from chance similarities in the small sample size studied. Previous studies have estimated that 15–20% of CpG sites in the human genome show tissue-specific interindividual variation [4, 14]. CoRSIVs are exceedingly rare by comparison, comprising only ~ 0.1% of the human genome.
Since thyroid, heart, and brain do contain some cell types in common (white blood cells, for example) one might suppose that CoRSIV signals could arise from interindividual variation in the proportional representation of these contaminating cell types. Given the very small proportion of blood cells in the three tissues, it is difficult to imagine that they could explain inter-tissue R2 ≥ 0.50. Nonetheless, we devised several tests of this alternative explanation. First, analyzing publicly available WGBS data on monocytes isolated from six individuals, we show interindividual variation at CoRSIV but not control regions (Additional file 4: Figure S4), demonstrating that the interindividual variation is evident even in a single highly purified cell type. Second, using samples collected as part of a study of infant twins, we performed bisulfite pyrosequencing at several CoRSIVs in 20–30 individuals and found that interindividual variation in fingernail DNA is correlated with that in peripheral blood (Additional file 4: Figure S5). Since fingernails are composed of keratinized nail matrix cells (and contain no leukocytes) these correlations clearly are not the result of blood contamination. Lastly, we used data from the Blueprint epigenome project [15] to assess the overlap of CoRSIVs with differentially methylated regions (DMRs) distinguishing two major leukocyte subtypes: B cells and neutrophils. If CoRSIV signals result from individual variation in the proportionality of contaminating leukocytes, CoRSIVs should overlap extensively with these DMRs. However, only 37 such overlaps were found (i.e., < 1% of the 9926 CoRSIVs) (see Additional file 3 for methods). The single explanation consistent with all of these observations is that CoRSIVs are bona fide regions of systemic interindividual variation in DNA methylation. Indeed, CoRSIVs are highly enriched for CpG sites previously shown to be highly correlated across four brain regions and blood [16] (odds ratio 16.5, P = 1.2 × 10− 81) (see Additional file 3 for methods).
Characteristics of CoRSIVs
Systemic individual differences in global DNA methylation could be attributed to genetic variants in, for example, genes regulating one-carbon metabolism or DNA methylation enzymes [17] or to technical variation in, for example, the time between death and tissue collection (post-mortem interval). In selecting donors we attempted to minimize potential sources of variation including age, body mass index, and post-mortem interval (Additional file 1: Table S5). Average CoRSIV methylation did not differ among the 10 donors studied (Fig. 1f and Additional file 4: Figure S6a), arguing against global regulation of or systematic influences on methylation at these regions. To visualize CoRSIVs throughout the genome, we generated an atlas of annotated CoRSIV maps for each human autosome (https://corsiv.shinyapps.io/CoRSIV_Plotter/) (example region shown in Fig. 1g). A striking feature of these maps is the extensive long-range correlation (and anti-correlation) among CoRSIVs. Long-range interindividual correlation in DNA methylation was previously reported in population studies of peripheral blood methylation using a commercial methylation array [12, 18]. For example, two regions separated by ~ 20 kb and overlapping the 5′ and 3′ ends of SPATC1L, previously identified as regions of anticorrelated methylation in peripheral blood [12], are in fact anticorrelated CoRSIVs (Additional file 4: Figure S7a). Since most haplotype blocks in Caucasians are < 50 kb in length [19], we were surprised to observe many examples of positively intercorrelated CoRSIV pairs spanning much larger genomic distances, which we refer to as “superCoRSIVs” (Additional file 4: Figure S2f). Topologically associated domains (TADs) are broad genomic regions (median size ~ 1 Mb) with a high probability of physical association and are largely invariant across different cell types [20]. We therefore tested whether superCoRSIVs tend to occur within TADs; indeed, across 10 human tissues, the proportion of superCoRSIVs wholly within TADs was consistently elevated relative to that of a set of matched control regions (paired t-test P = 10− 5, Additional file 4: Figure S2f). Since TAD boundaries are associated with CTCF sites [20], our observation that CTCF binding sites are enriched within SuperCoRSIVs (χ2 test P = 0.003, Additional file 4: Figure S2 g) additionally suggests a mechanistic link between SuperCoRSIVs and TADs.
Whereas most of the 9926 CoRSIVs are only 200–300-bp long and include 5–10 CpGs, the largest span several kb and involve hundreds of CpGs (Fig. 2a). Rather than being randomly distributed throughout the genome, CoRSIVs tend to occur in clusters (Kolmogorov–Smirnov test vs. uniform distribution: P < 10− 100). The two biggest peaks of CoRSIV density are observed at the major histocompatibility (MHC) locus on chromosome 6 and the pericentromeric region on the long arm of chromosome 20 (Fig. 2b); illustrative profiles of individual methylation at one gene from each region are shown in Fig. 2c. SIV in DNA methylation was previously reported in the MHC locus [21]. We are not aware of previous publications, however, highlighting the exceptional epigenetic behavior of the chromosome 20 region. Interestingly, whereas the 92 CoRSIVs spanning the MHC region are largely independent of one another (Additional file 4: Figure S8a,b), the 56 CoRSIVs comprising the chromosome 20 region are highly intercorrelated (Additional file 4: Figure S8c,d), indicating that individual methylation status is correlated across this entire ~ 2.3-Mb region. As suggested by the circos plot (Fig. 2b) and the higher-resolution plots in Additional file 4: Figure S9, CoRSIVs are > 2-fold enriched in subtelomeric regions (χ2 test P < 10− 300, Fig. 2d, Additional file 1: Table S6). Given that subtelomeric regions are highly variable, one might suppose that the CoRSIV enrichments in these regions might be due to poor mapping rates leading to artifacts. However, unique mapping rates in the chromosome 20 pericentromeric region were generally above, and those in subtelomeric regions only slightly below, the genomewide average (Additional file 4: Figure S10). Further, only 14 of the 9926 CoRSIVs overlapped an ENCODE blacklist region [22] (genomic regions known to yield artifacts in functional genomics studies) (see Additional file 1: Table S19).
To explore sequence features associated with CoRSIVs, we generated two reference sets of genomic regions: a randomly selected set of genomic regions matched to the CoRSIVs on chromosome, size, and CpG content (“controls”) (Additional file 4: Figure S11a, Additional file 1: Table S7), and a set of similarly matched loci drawn from regions of tissue-specific differential methylation (tDMRs) (Additional file 4: Figure S11b,c, Additional file 1: Table S8). Compared to both reference sets, CoRSIVs are enriched for transposable elements and depleted for CpG islands (CGI) and transcription factor binding sites (χ2 test P < 10− 8 for all comparisons; Fig. 2e, Additional file 4: Figure S2 h, Additional file 1: Table S9). Relative to either controls or tDMRs, CoRSIVs are under-represented within and near genes and enriched in intergenic regions (χ2 test P < 10− 16; Fig. 2f, Additional file 1: Table S10). Analysis of ChromHMM features across 111 reference human epigenomes (derived from the NIH Epigenome Roadmap project) [23] shows that, relative to control regions, CoRSIVs are enriched in quiescent regions and those associated with repressive polycomb marks and depleted in heterochromatic regions, active promoters, and enhancers (Fig. 2g); the strongest depletions were found for promoter and enhancer regions characterized as “bivalent” (i.e., poised between active and inactive states) in embryonic stem cells [24]. Similar depletions/enrichments are observed when CoRSIVs are compared to tDMRs (Additional file 4: Figure S2e, Additional file 1: Tables S11, S12). Analysis of data on genomic evolutionary rate profiling scores [25] showed that, relative to both control and tDMR regions, CoRSIVs tend not to occur in highly conserved genomic regions (CoRSIVs vs. controls odds ratio = 0.4, P = 8.2 × 10− 118; CoRSIVs vs. tDMRs odds ratio = 0.2, P < 1.0 × 10− 200) (Additional file 4: Figure S2i).
Influence of genetic variation on CoRSIVs
We took several complementary approaches to evaluate the extent to which DNA methylation at CoRSIVs is genetically determined. Associations between genetic variation and DNA methylation can be assessed relative to single genetic variants (methylation quantitative trait loci—mQTL) [26, 27] or at the haplotype level (haplotype-dependent allele-specific methylation) [28]. As previously reported [29], the PM20D1 promoter region illustrated in Fig. 1d exhibited strong mQTL. With methylation data on only 10 individuals, however, we were underpowered to analyze mQTL at all 9926 CoRSIVs, so performed focused analyses on several regions of high CoRSIV density (Fig. 2b). We used donor-specific SNV data from GTEx [9, 30] to test for associations between individual-average CoRSIV methylation and SNV genotype. Since the strongest mQTL effects occur over fairly short distances [28], we assessed associations with common variants within 10 kb of each CoRSIV and, to minimize false positives, considered only SNVs for which all three genotypes were represented. Of 96 informative CoRSIVs in the high-density region overlapping the MHC locus on chromosome 6 (Fig. 2b, Additional file 4: Figure S8a), 74 showed significant mQTL (Fig. 3a, left), corroborating a previous report that the MHC locus is an mQTL hotspot [12]. Conversely, in the high-density CoRSIV region near the chromosome 20 centromere (Fig. 2b, Additional file 4: Figure S8c), none of 10 informative CoRSIVs showed mQTL (Fig. 3a, right). Next, we considered all pairwise comparisons among the 10 donors to test associations between interindividual differences in methylation and average SNV genotype in the 10-kb regions flanking each CoRSIV. This analysis (Fig. 3b) indicated that ~ 60% of interindividual variation in CoRSIV methylation is explained by genetic variation in cis. We also exploited several published mQTL data sets based on the Illumina HM450 array [26, 27, 31,32,33]. Of the ~ 485,000 CpG-specific probes included on the HM450 array, just 1659 overlap 819 CoRSIVs (Additional file 1: Table S13). Across eight populations studied by five different groups, approximately half of these CoRSIVs consistently showed mQTL, while ~ 40% showed no evidence of mQTL (Fig. 3c, Additional file 1: Table S14). The methylation distribution in each of the 10 GTEx donors is substantially different for mQTL-negative vs. mQTL-positive CpGs (Additional file 4: Figure S6b), providing independent corroboration of the two classes evident in the HM450 mQTL analysis.
We next used the DECIPHER database [34] to test whether CoRSIVs might be explained by copy number variations (CNVs). Unsurprisingly (given their association with subtelomeric regions), CoRSIVs were significantly enriched for overlaps with CNV regions compared to control (odds ratio = 1.8, P = 1.7 × 10− 94) and tDMR regions (odds ratio = 2.7, P = 8.1 × 10− 100). To determine if SIV among the 10 GTEx donors is associated with deletions or duplications, we calculated Spearman correlations between average WGBS read depth and average CoRSIV methylation across the 10 individuals. The P value distribution (Fig. 3d) shows that, for the vast majority of CoRSIVs, individual-average methylation is not significantly associated with read depth, indicating that CNV generally does not explain SIV at CoRSIVs.
Previous population-based studies have examined haplotype effects on regional methylation by comparing the decay rate of correlation in methylation with that of genetic linkage [11, 12]. We likewise analyzed methylation data at CoRSIVs in the context of linkage disequilibrium (LD) data on Caucasian individuals (CEU) in the 1000 genomes project [35]. Relative to control regions, CoRSIVs were associated with regions of higher LD, and methylation in CoRSIVs was more highly correlated with that in flanking regions (Fig. 3e). Whereas previous studies on methylation variants in Arabidopsis [11] and humans [12] found that methylation correlation decays very rapidly (within 1–2 kb), methylation decay at CoRSIVs is much longer, not returning to baseline until ~ 20 kb (Fig. 3e). To examine regional associations between genetic and methylation variation, we therefore assessed the linear correlation between methylation decay and LD decay within 20 kb upstream and downstream of each CoRSIV. The resulting P value distribution (Fig. 3f, Additional file 1: Table S15) is highly enriched for significant values, providing evidence of genetic influence on methylation. Examples drawn from opposite ends of the P value distribution illustrate that this approach distinguishes CoRSIVs located entirely within haplotype blocks—and putatively under genetic control—(Fig. 3g) from those in which strong regional correlation in methylation occurs in the absence of underlying haplotype structure (Fig. 3h), indicating “pure” epigenetic variation [11].
Together, these analyses show that methylation at many CoRSIVs is influenced by genetic variation in cis but also suggest that, at some of these loci, systemic interindividual epigenetic variation may be independent of genetic variation. With the data currently available, we cannot yet make definitive conclusions about the absence of genetic effects. Characterization of all possible genetic influences on CoRSIV methylation (including trans effects, which were not evaluated here) will require methylation analyses targeted to these regions in a large number of genotyped individuals. Systemic interindividual epigenetic variation that is largely independent of genetic variation is consistent with metastable epialleles [36]. In mouse and human studies, epigenetically metastable loci are characterized by stochastic interindividual variation in epigenotype that is established in the early embryo [5] and influenced by periconceptional environment [7, 8, 37, 38].
Influence of periconceptional environment on CoRSIVs
To assess potential epigenetic metastability at these loci, we examined data from a natural experiment in a well-characterized population of subsistence farmers in rural Gambia in which annual climatic variation causes dramatic seasonal variation in maternal nutritional status and energy balance. Previous studies in this population show that, at candidate and bona fide metastable epialleles, children conceived during the peak of the rainy season show persistent and systemic elevations in DNA methylation compared to children conceived during the dry season, and these correlate with the mothers’ methyl donor metabolomes assessed in early gestation [37]. Here, rather than focus on pre-selected epochs, we studied 233 children conceived throughout the calendar year [13] and performed Fourier regression analysis to determine seasonal effects agnostically. DNA methylation in peripheral blood was measured by Illumina HM450 arrays, so this analysis is limited to CoRSIVs and control regions that are informative on this platform. Significant seasonal variation was highly enriched at CoRSIVs (Fisher’s exact test P = 1.7 × 10− 6) but not at controls or tDMRs (Fisher’s exact test P = 1.0 and 0.85, respectively, Additional file 4: Figure S12a,b). Moreover, this unsupervised analysis independently validated seasonal effects corresponding to peaks of the rainy (July–September) and dry (January–April) seasons (Fig. 4a) consistent with previous studies of human metastable epialleles [6, 8, 13, 37]. It should be noted that in these outbred human populations we have yet to rule out potential confounding by genetic variants that influence both season-specific fecundity and systemic methylation. This caveat notwithstanding, these data provide independent evidence that a significant proportion of CoRSIVs exhibit epigenetic metastability, which can be influenced by maternal nutrient status during early pregnancy.
Associations of CoRSIVs with gene expression and human disease
CoRSIVs show great promise for epigenetic epidemiology [39] because, at these loci, DNA methylation measurements in any easily biopsied tissue should provide information about epigenetic regulation throughout the body. To test this, we analyzed data from a previous large population study of adult women in which DNA methylation in adipose tissue was measured by the HM450 array and gene expression was assessed in adipose tissue, skin, and lymphoblastoid cell lines [26]. Unlike at control regions and tDMRs (Additional file 4: Figure S13b), among gene-associated CoRSIVs that showed an association between methylation and expression in adipose tissue, most also showed an association between methylation in adipose tissue and gene expression in skin or lymphoblastoid cell lines (Fig. 4b,c, Additional file 4: Figure S7b, Additional file 1: Table S16). These data substantiate the practical utility of CoRSIVs, in that non-invasive measurements of DNA methylation (for example, in peripheral blood) can provide an indication of epigenetic regulation in other tissues. To explore this further, we analyzed a recently published database encompassing 1319 “epigenome-wide association studies” (EWASdb) [40] (based on DNA methylation, many in peripheral blood). As expected (based on the systemic nature of CoRSIVs), this analysis showed that disease-associated CpG sites are 37% enriched in CoRSIVs relative to control regions (χ2 test P = 2.2 × 10− 4) and 53% enriched in CoRSIVs relative to tDMRs (χ2 test P = 6.5 × 10− 8) (Additional file 4: Figure S12e). To explore these associations in greater depth, we began by dividing the disease outcomes into cancer and non-cancer diseases, since the HM450 array is enriched for CpG sites known to be hypermethylated in cancer, and many epigenetic studies of cancer are based on comparing tumor vs. normal tissue, which is not relevant to the systemic nature of CoRSIVs. Whereas CoRSIV probes were not enriched for disease associations with cancer (Additional file 4: Figure S14), they did show strong associations to non-cancer diseases (Additional file 4: Figure S15). The greatest enrichments were found for “deletion or duplication of chromosome 7q11.23,” “immunodeficiency, centromeric instability and facial abnormalities (ICF) syndrome,” “asthma,” “down syndrome,” and “human immunodeficiency virus.” Notably, though several of these relate to immune function, none of the probes driving these enrichments were within the CoRSIV-rich MHC region on chromosome 6.
Although CoRSIVs are depleted in and around genes (Fig. 2f), there are thousands of CoRSIV-associated genes. As another way to explore potential phenotypic consequences of epigenetic variation at CoRSIVs, we used GREAT [41] to perform gene ontology analysis (Additional file 4: Figure S16). The main significant enrichments (vs. genomic background) relate to the high density of CoRSIVs identified in the MHC locus. Otherwise, CoRSIVs were not associated with specific cellular processes or components. We used an automated literature search tool (PubTator) [42] to search PubMed and identify diseases linked to the 3127 CoRSIV-associated genes. This analysis initially found that, compared to all genes listed in PubMed, CoRSIV-associated genes are enriched for associations with hundreds of diseases, particularly those involving the brain (Additional file 1: Tables S17, S18, Additional file 4: Figure S12c). Recent studies have shown that genes expressed in the brain (particularly in neurons) tend to be longer than average [43], so we tested whether our analysis based on gene-body overlap could bias toward longer genes. Indeed, relative to all PubTator genes, CoRSIV-associated genes are enriched for long genes (Additional file 4: Figure S12d). After adjusting for gene length, CoRSIV-associated genes were no longer associated with diseases involving the brain (Additional file 1: Table S17). Nonetheless, thousands of CoRSIV-associated genes are implicated in a wide range of human diseases including neoplasms, mental disorders, digestive, nervous system, and cardiovascular diseases (Fig. 4d, Additional file 1: Table S18). Recent reports highlight several examples, including SPATC1L in male infertility [44], PM20D1 in Alzheimer’s disease [29], and DUSP22 in lymphoma [45].