The relationship between DNA methylation, genetic and expression inter-individual variation in untransformed human fibroblasts
© Wagner et al.; licensee BioMed Central Ltd. 2014
Received: 15 July 2013
Accepted: 20 February 2014
Published: 20 February 2014
DNA methylation plays an essential role in the regulation of gene expression. While its presence near the transcription start site of a gene has been associated with reduced expression, the variation in methylation levels across individuals, its environmental or genetic causes, and its association with gene expression remain poorly understood.
We report the joint analysis of sequence variants, gene expression and DNA methylation in primary fibroblast samples derived from a set of 62 unrelated individuals. Approximately 2% of the most variable CpG sites are mappable in cis to sequence variation, usually within 5 kb. Via eQTL analysis with microarray data combined with mapping of allelic expression regions, we obtained a set of 2,770 regions mappable in cis to sequence variation. In 9.5% of these expressed regions, an associated SNP was also a methylation QTL. Methylation and gene expression are often correlated without direct discernible involvement of sequence variation, but not always in the expected direction of negative for promoter CpGs and positive for gene-body CpGs. Population-level correlation between methylation and expression is strongest in a subset of developmentally significant genes, including all four HOX clusters. The presence and sign of this correlation are best predicted using specific chromatin marks rather than position of the CpG site with respect to the gene.
Our results indicate a wide variety of relationships between gene expression, DNA methylation and sequence variation in untransformed adult human fibroblasts, with considerable involvement of chromatin features and some discernible involvement of sequence variation.
Perhaps the best studied of epigenetic phenomena, the methylation of CpG dinucleotides, has been known for many years to play a key role in X-chromosome inactivation , transcriptional silencing of foreign DNA elements  and imprinting of genes , while aberrant DNA methylation is implicated in many types of cancer . The relationship between methylation and gene expression is complex, with high levels of gene expression often associated with low promoter methylation  but elevated gene body methylation , and the causality relationships have not yet been determined. In cell populations, the levels of DNA methylation across CpG sites in the genome is typically regarded as bimodal, with CpG-rich regions known as CpG islands, often associated with transcription start sites (TSSs), typically showing hypomethylation, and other CpG sites showing hypermethylation (reviewed in ).
Methylation has been shown to be highly variable across cell types with variable sites falling in two broad categories: those with inverse correlation between DNA methylation and chromatin accessibility, and those with variable chromatin accessibility and constitutive DNA hypomethylation . As reviewed by Cedar and Bergman , DNA methylation and histone modifications share many relationships from the time of embryonic development onwards, including hypothesized roles of DNA methylation preventing the tri-methylation of histone 3 lysine 4 (H3K4me3), a marker generally associated with active promoters, as well as H3K4me3 preventing DNA methylation .
Methylation also varies between healthy individuals in a population. Relationships between DNA methylation, gene expression and various other genetic and epigenetic biomarkers have been examined previously. Recent studies have identified SNPs whose genotype correlates with DNA methylation (termed methylation quantitative trait loci, or mQTLs) in various human populations and cell types. Bell et al.  utilized the HumanMethylation27 BeadChips from Illumina to map associations between SNPs and methylation levels at 22,290 CpG dinucleotides in lymphoblastoid cell lines (LCLs), finding 180 CpG sites associated with nearby SNPs, and an enrichment for expression QTLs (eQTLS) amongst mQTLs. Gibbs et al.  used the same DNA methylation platform to study samples from four human brain regions in 150 individuals and reported hundreds of SNP-associated CpG sites in each brain tissue, with mQTLs typically located very close to the associated CpG site, and thousands of both mQTLs and eQTLs, but only modest overlaps between the two, averaging 13 CpG sites per tissue having a significant mQTL that was also an eQTL. Similar results were seen using 180 LCLs derived from one African and one European population . Zhang et al. performed similar analyses using the same methylation platform in 153 human adult cerebellum samples, finding 2,046 CpG sites with mQTLs; they reported that, in general, CpG sites located in CpG islands are more likely to be mappable to a SNP than non-CpG island sites. They also assessed the relationship between expression and methylation, with 20 of 112 CpG-gene pairs analyzed showing nominally significant correlations, with 5 of these 20 being positive correlations and the rest negative. At present, though it is known that there is a genetic component to both variable DNA methylation and gene expression, as well as genome-level differences in gene expression linked to DNA methylation, the combined relationships between the three factors remains poorly understood. Recent research  has examined the relationship between sequence, expression and DNA methylation as measured by the HumanMethylation27 assay in whole blood, finding numerous cases of methylation/expression relationships but focusing on the small number of cases in which a genetic component was also found. Drong et al. report 149 CpG sites mappable to an mQTL when making use of differential methylation hybridization covering 27,718 genomic regions in 38 unrelated individuals, finding none of the mQTLs to also be eQTLs. Gutierrez-Arcelus et al. report positive and negative expression-methylation relationships at the inter-individual level in fibroblasts, T cells and LCLs derived from a set of 204 umbilical cords from healthy newborns of European descent, with negatively correlated CpG sites enriched at ENCODE derived enhancer and promoter sites.
To further understand the relationship between genetics, gene expression, DNA methylation, and other epigenetic marks, we present analyses of DNA methylation, gene expression (both total and allelic) and DNA sequence polymorphisms from a set of 62 fibroblast cell lines derived from healthy human individuals, augmented with publicly available histone mark and DNase I hypersensitivity (DHS) data. We show that: widespread relationships exist between DNA polymorphisms and DNA methylation (mQTLs); widespread relationships exist between DNA methylation and gene expression, especially in developmentally significant genes, including all four HOX clusters; supplementing expression quantitative trait locus (eQTL) data with mapping of allelic expression to adjacent SNPs, a large set of regions and genes mapped to a QTL that also functions as an mQTL, comprising 242 genes and 23 regions not overlapping with an annotated gene; and CpG sites where methylation correlates with gene expression in cis do not, in general, show strong overlap with annotated genes or promoter regions - rather, CpG sites where this correlation is negative are most commonly seen in sites associated with active promoter marker H3K4me3 and DHS regions, while those with positive correlation are most commonly seen in the presence of the repressive chromatin marker H3K27me3 (histone 3 lysine 27 tri-methylation).
We report on the joint analysis of inter-individual variation in the levels of DNA methylation, total and allelic expression, and DNA sequence of 62 healthy parents of 31 parent-child trios of European descent. Here, we start by introducing each data set individually before discussing the relationships among them.
DNA methylation assays
DNA methylation was assayed in forearm skin fibroblast samples using the Illumina 450 K assay (Materials and methods). For each sample, methylation was measured at approximately 485,000 CpG sites, but we only considered the approximately 392,000 sites uniquely mapped in autosomes and containing no known SNPs. Methylation levels are measured in populations of diploid cells using beta values , which range from 0 (no methylation) to 1 (complete methylation of the two alleles). Methylation measurements were highly replicable, with the Pearson correlation coefficient between beta values of two replicates exceeding 0.99 in each of three pairs of biological replicates, while the average pairwise correlation coefficient between methylation from different samples levels ranges around 0.95; Additional file 1). Surrogate Variable Analysis  was used to identify possible batch effects accounting for inter-individual methylation variation but none were detected, suggesting that the observed variation may mostly be due to stochastic, environmental, or genetic effects.
We examined the levels of inter-individual variation of methylation probes, finding a drop in variation of probes located within 1,500 bp of a TSS annotated for an actively expressed gene (Figure 3B; Additional file 4B), with only 11% of probes near the TSS of a top quartile expression gene also being in the top quartile of methylation variation, compared to 30% for CpG sites adjacent to the TSS of a bottom quartile expression gene. These results were corroborated by the finding that sites with low inter-individual methylation variation were enriched for DHS and H3K4me3, and, to a lesser degree, sequence conservation (Figure 2B; Additional file 3B).
On the contrary, highly variable CpG probes (top 25%, standard deviation >0.0932) are usually located far away from the TSS (either in intergenic regions or in the gene body), or are located near the TSS of genes with low expression in fibroblasts and generally lack regulatory or evolutionary marks of function. The majority of these CpG sites show a unimodal distribution (Additional file 5). Genes whose TSS regions contain highly variable CpG probes were enriched for Gene Ontology (GO) terms related to multicellular organismal development (Additional file 6, worksheet 1), compared to the full set of genes having at least one CpG probe in the TSS region. Unexpectedly, extremely variable CpG probes (top 5%, standard deviation >0.15) show a marked increase in their overlap with DHS and H3K4me3 marks. Genes collocated with these CpG probes are even more strongly enriched for having functions related to development, and include a large number of genes from the HOX clusters (see Discussion).
Gene expression analysis
RNA expression levels for the 62 individuals were measured using the Illumina HumanRef8 microarray platform, giving expression levels for 21,916 probes mapping to a total of 16,952 genes. Only probes that showed moderate to high inter-individual expression variation (standard deviation >0.1127, corresponding to a total of 9,493 genes) were considered for further analyses.
To complement total expression data, allelic expression (AE) was assayed at a set of approximately 900,000 SNP locations dispersed in annotated genes and intergenic regions of all autosomes using hybridization to genotyping arrays, as previously described  (see Materials and methods). For each sample and each heterozygous SNP, the ratio of the expression level of each allele is estimated, after normalization to genomic DNA. Of 24,814 known canonical UCSC genes, 81% have at least one assayed SNP within their boundaries. A previously described  hidden Markov model was used to reduce the noise in the data and estimate, for each SNP of each sample, the expected true allele expression log-ratio. We note that because this approach does not make use of gene annotation, it is able to detect AE at transcripts that do not, or only partially, overlap annotated genes. However, detection power for genes that are short or contain a small number of SNPs is reduced.
As previously reported for other cell types , AE was seen to be widespread. We defined an aeSNP as a SNP whose expected log2 allele ratio is above 0.2 in at least two samples (which corresponds to 5% false discovery rate (FDR); Materials and methods), and found 74,624 aeSNPs within annotated gene regions (corresponding to 15.8% of genic/intronic SNPs), and 25,467 outside (corresponding to 5.4% of intergenic SNPs). aeSNPs were clustered into 3,327 aeRegions (consisting of two or more consecutive aeSNPs), of which more than 80% had full or partial overlap with an annotated gene (Additional file 7), similar to results previously obtained in lymphoblasts  (for a full list of aeRegions, see Additional file 8).
Linking methylation and genetic variation
Inter-individual methylation variation is likely due to both genetic and environmental variation between samples. To determine the relationship between genetic variation and CpG methylation levels, we first genotyped our 62 samples (Materials and methods). We then mapped CpG beta values to the imputed genotype at polymorphic sites within 250 kb (absolute value Spearman’s rho above 0.452, which corresponds to a P-value of 6 × 10-6 and an FDR of 5% (Materials and methods)). A set of 27,486 pairs (Additional file 9) were retained as significant, involving a total of 1,676 mappable CpG probes and 19,561 candidate mQTLs. Whole genome bisulfite sequencing-derived DNA methylation data were generated for four fibroblast cell lines (Additional file 10) and used to validate array methylation detected at mappable CpG loci. We observe high concordance between array- and sequencing-derived methylation for highly variable CpG sites across the four cell lines (254 loci; median Pearson correlation coefficient = 0.84).
Linking gene expression and genetic variation (eQTLs)
Linking gene expression to DNA methylation
We identified genes whose expression levels correlated with methylation levels of high-variance CpG probes located within their body or 250 kb on either end (absolute value Spearman’s rho >0.506, P-value <5.132 × 10-5, resulting in an FDR of 5%; see Materials and methods). This resulted in the identification of 587 genes with correlation to at least one of 1,793 CpG probes (Additional file 13). Extremely variable CpG sites are strongly over-represented amongst sites correlated with gene expression (Figure 4), and correlated CpG sites are 1.6-fold and 3.2-fold enriched, respectively, for bimodal and trimodal sites relative to the set of highly variable CpG sites.
Remarkably, methylation-correlated genes are far from representing an unbiased sample of the genome, with 78 (13%) of them being known transcription factors (GO enrichment P-value = 8.23 × 10-16) and 145 (24%) involved in multicellular organismal development (GO enrichment P-value = 6.1 × 10-22) (Additional file 6, worksheet 2). These include a number of genes from each of the four HOX clusters, together with several other key regulators of development and cellular differentiation such as EN1, HAND2, TBX1, TBX2, TBX3, TBX5, and TBX15.
In order to find genomic features that may help distinguish CpG probes that correlate positively and negatively with gene expression, we turned to DHS and histone modification data obtained by the ENCODE consortium , considering data from five human fibroblast cell lines. Though these cell lines were not derived from the same donors as used in this study, we found in general that they allowed a clear separation between the two types of CpG probes (Figure 9). CpG probes where methylation levels correlated negatively with gene expression are for the most part located in regions with marks of regulatory activity (H3K4me3 or DHS): marks that are less frequent among CpG probes that show no correlation with expression and even less frequent among those that show a positive correlation. In contrast, positively correlated probes were slightly more often seen with the inactive gene-associated marker H3K27me3 when compared with negatively correlated probes.
Overlap between mQTLs and eQTLs
Homeodomain transcription factor PAX8, transcription of which has been identified as an important biomarker in distinguishing various tumor types (reviewed in ), presented another particularly interesting case of overlap between the various types of relationships (Figure 13B), where CpG probes located near the gene’s TSS were unexpectedly positively correlated with the gene expression and those located in its body were negatively correlated. A possible explanation may involve putative uncharacterized transcript DKFZP686E10196, antisense to and located within PAX8, whose expression would be negatively correlated with the CpG methylation at sites near its TSS (but in the body of PAX8) but positively correlated probes in the body of the transcript (but near the TSS of PAX8). Indeed, RNA-seq data obtained from three individuals with differing genotypes in the cis-associated emQTLs suggest that the expression of PAX8 and its antisense transcript are positively correlated, ruling out an interference between the two but instead hinting at a possible chromatin-linked role of DKFZP686E10196 activation in regulating PAX8 transcription. (For a recent review of antisense regulation, see ).
Gene clusters of glutathione transferase families GSTM and GSTT also show multiple genes being mappable to similar sets of CpG probes and SNPs (Figure 13C,D), with active marks DHS and H3K4me3 located near negatively correlated CpG probes.
We estimated the proportion of gene expression variation that could be explained by either sequence variation alone or by a combination of sequence variation and DNA methylation, using a simple linear model and five-fold cross-validation (Materials and methods). For each gene, the five SNPs (within 250 kb) jointly explaining the largest portion of the expression variation on the training data were sequentially identified and regressed out. Independently of this we regressed out the five CpG sites explaining the largest portion of the expression variation. We found a total of 25.5% of gene expression variation to be explained by sequence variation, whereas methylation explained only 8.9% of expression variation. We applied a third model in which the top five SNPs were regressed out and then the top five CpGs were regressed from the residuals, finding in this case the variation explained by methylation dropped to 5.9%. This suggests that 5.9/8.9 = 66% of methylation-facilitated gene expression variation was independent of sequence variation. These figures are considerably higher than the 1.2% and 3.3% variation of expression explained, respectively, by DNA sequence and DNA methylation found by Li et al. in breast tumors , indicative perhaps of much greater variation of gene expression brought about by other factors in the tumor micro-environment.
We have analyzed the inter-individual variability of and relationships between one of the most comprehensive set of biomarkers in untransformed adult cells to date, including a more expansive assay for DNA methylation containing a large and diverse set of CpG dinucleotide probes; gene expression data; SNP data and allelic expression data, augmented with publicly available histone mark and DHS data from other cell cultures.
We chose primary skin fibroblasts as a model system. These comparatively easy to isolate and cultivate cells are a readily accessible source of patient material, and are in use as a model system for complex disease etiological studies (for example, Parkinson’s disease ). Epigenomes are tissue-specific, however, so the use of primary skin fibroblasts is limited to gaining insight into complex diseases of skin fibroblastic origin, or being a complementary tool with the requirement of additional studies of (the mostly more difficult to derive) primary patient tissue material. Of course, the limited sample size of this study reduced our ability to detect weak associations. However, complementing eQTL with mapping of AE significantly increases the sensitivity of our expression mapping , resulting in the discovery of many more expression/methylation QTLs than reported before.
Although most CpG sites with variable methylation seem unrelated to variation in gene expression, a non-negligible portion show significant correlations. Remarkably, the properties of these relationships appear quite complex, and the location of CpG probes with respect to the gene provides relatively little information about the sign of the correlation. Instead, chromatin states, particularly those that are representative of active chromatin and transcribed regions (DHS and H3K4me3) were more strongly indicative of negative correlation. Using the publicly available ENCODE data, we found in general that negatively correlated probes most strongly overlapped with regions of constitutive DHS but variable H3K4me3 among the five fibroblast cell lines considered, whereas positively correlated probes most strongly overlapped with an indicator of inactive transcription, H3K27me3. Work published in the ENCODE paper on DHS  indicated an inverse correlation of DNA methylation and DHS, and the authors provided evidence that DNA methylation was excluded as a consequence of open chromatin, rather than DNA methylation preventing this opening from occurring. H3K4me3 was also previously found to be inversely correlated with DNA methylation , with evidence for causality pointing in both directions. We have found further signs of intriguing links between all of these marks, and hope for experiments in the future more actively measuring these marks within the same cell lines to give better clues as to causality and to establish the constitutive and variable marks included in methylation-expression relationships.
Whether the associations between gene expression and methylation truly reflect variation in tissues or other differences acquired after sample collection is an important and challenging question. One possible source of post-sample collection variation is differences in cell proliferation rates. However, we have found that cell proliferation variation explained only 8% of the variance in methylation levels of expression-correlated CpG sites, and 13% of the variance in expression levels of methylation-correlated genes. Among mappable CpG sites and genes, the proportion of variance explained was negligible (0.7% and 5%, respectively).
Relatively high overlap was seen with results from previous studies in terms of the rare genes where both expression and methylation could be linked to genetic variation (emQTLs). In particular, C21orf56, a gene for which we find many emQTLs in fibroblasts, also exhibits the same property in whole blood  and LCLs . Several other genes having emQTLs in whole blood  (GSTM3, NAPRT1, SPG7 and WBSCR27) were also identified in our assay, indicating that genetic variation leading to both methylation and expression variation at the same locus is a relatively rare but reproducible phenomenon, the mechanism and implication of which merit further investigation. We report a total of 260 annotated genes or aeRegions that, to our knowledge, have not been previously reported as having emQTLs, including 23 aeRegions having no overlap to annotated genes. We attribute these discoveries to the usage of AE assays as well as a gene expression microarray experiments, together with use of the relatively recently developed Illumina Infinium HumanMethylation450 platform, interrogating methylation at a larger and more diverse set of CpG sites compared to most previous studies. As the effect of methylation on gene expression can in some cases involve cell-specific trans-acting factors , additional emQTLs could be found if we were to extend our analyses to additional cells or tissues. Future studies with larger sample sizes, investigating more diverse sets of cell types and utilizing platforms with even more comprehensive coverage of CpG sites can only help to uncover a greater number and potentially more subtle cases of associated DNA methylation, gene expression and DNA sequence variation.
Relationships between gene expression and DNA methylation in a population setting have not been investigated as extensively as sequence-expression or sequence-methylation relationships. However, previous high-throughput gene expression studies in fibroblasts have revealed intriguing results. In a landmark paper  assessing gene expression in skin fibroblasts derived from various anatomical sites, genes involved in a) extracellular matrix formation, b) cell signaling or fate determination, and c) cell migration signals were found to be expressed in a positional dependent fashion. Most notably of all, clustering of the samples based solely on the expression levels of 51 HOX genes recapitulated their site of origin. Koch et al. strengthened these results by also finding positional-dependent DNA methylation at HOX loci in a set of skin fibroblast samples. In the present study, fibroblast samples drawn from the same site but from different individuals show considerable DNA methylation variation in CpG sites proximal to all four HOX clusters, and a subset of HOX genes are amongst those with the closest expression-methylation ties in the genome. However, the HOX genes with correlations to methylation reported differ from those previously found to have position-dependent expression , indicating additional layers of complexity and additional factors affecting fibroblast HOX methylation/gene expression beyond position in the body. Parents from several of the trios showed similar HOX expression and methylation profiles, indicating perhaps an environmental rather than a genetic origin for these characteristic patterns. Although this was not discussed in their paper, the data reported by Gutierrez-Arcelus et al. also indicated that all four HOX clusters, as well as PAX8, showed high levels of methylation/expression correlations in each of the three cell types they studied. Future studies taking into account more carefully the environment and background of unrelated, healthy individuals will be paramount in understanding more clearly the factors at play in DNA methylation and gene expression of these fascinating loci. Overall, the inter-individual variability in gene expression seen in this fibroblast dataset, and the relationship of this variability to DNA methylation shows intriguing parallels to results seen with positional gene expression and DNA methylation variability in fibroblasts.
Genetic and methylation variation jointly explain 31% of gene expression variation in our fibroblast samples. However, the mechanisms involved appear complex and diverse, with a close interplay with other epigenetic marks. Further studies assaying inter-individual variation in histone marks and chromatin accessibility, ideally in an allele-specific manner, may bring the context necessary to the interpretation of sequence and methylation variation.
We report a comprehensive analysis of relationships between sequence variation, DNA methylation and gene expression in untransformed adult human fibroblast cells. Consistent with previous reports showing positional effects in fibroblast on HOX gene expression  and DNA methylation , we show inter-individual variation and correlation between DNA methylation and gene expression in fibroblast cells even when drawn from the same location in the body. CpG sites with positive and negative correlations to gene expression show distinctive patterns with respect to the histone marks and chromatin accessibility seen in their genomic region in other fibroblast cell lines. We find in general the most remarkable relationships found with these data to be those involving gene expression and DNA methylation in developmentally significant regions having little or no discernible involvement of DNA sequence variation.
Materials and methods
Description of cell lines and cell culture
Primary skin fibroblasts were obtained from Coriell (Camden, NJ, USA) and the McGill Cellbank (Montreal, QC, Canada). Cells were grown in alpha MEM Medium (SigmaAldrich, Oakville, ON, Canada) supplemented with 2 mmol/l L-glutamine, 100 U/ml penicillin, 100 mg/ml streptomycin, and 10% fetal bovine serum (SigmaAldrich) at 37°C with 5% CO2 to 70 to 80% confluence, then harvested and stored at -80°C until RNA and DNA were extracted.
DNA and RNA extractions
Genomic DNA (gDNA) for SNP genotyping and DNA methylation analysis was extracted from cell lysates using the GenElute DNA Miniprep Kit (SigmaAldrich) and DNeasy Blood and Tissue Kit (QIAGEN, Valencia, CA, USA), respectively, according to manufacturer’s protocol. DNA concentrations were determined using the Quant-iT PicoGreen kit (Invitrogen, Burlington, ON, Canada). Total RNA was extracted from cell lysates using the RNeasy Mini Kit (QIAGEN) according to the manufacturer’s protocol, and treated with 6 U DNase I. RNA quality was confirmed to be high for all samples on the Agilent 2100 Bio-Analyzer (Agilent Technologies, Mississauga, ON, Canada), with an RNA integrity number (RIN) range of 8.1 to 10, and concentrations were determined using the Nanodrop ND-1000 (NanoDrop Technologies, Wilmington, DE, USA).
450 K methylation array
gDNA (500 ng) was used for bisulfite conversion employing the EZ DNA Methylation Kit (Zymo Research, Irvine, CA, USA), according to the manufacturer’s protocols. The modified gDNA was processed as described in the Infinium Assay Methylation Protocol Guide Rev. C (November 2010), and analyzed on Infinium HumanMethylation450 BeadChips (Illumina; refer to  for more details), measuring DNA methylation at single CpG site resolution based on genotyping of C/U polymorphisms. We excluded probes with ≥90% sequence similarity to multiple genomic locations, probes with sequence variants in the probe-binding region and probes located on sex chromosomes, leaving 392,904 probes for further analyses. For removal of variant-containing probes HapMap (release 28, 30 CEU trios) annotated variants were imputed with 1000 Genomes project variants (pilot), and probes mapping more than one variant were removed. As a measure of methylation we chose the beta value, which theoretically ranges from 0, indicating no methylation at any allele, to 1.0 for complete methylation of both alleles.
Beta values of CpG probes were quantile normalized separately for type I and type II probes, with the reference distribution being the distribution of average per-probe beta values. Surrogate variable analysis  was carried out using the sva package in Bioconductor  and identified no hidden variables responsible for variation in data. Furthermore, following a methodology similar to that of Bell et al., residuals obtained after regressing out up to five principal components were mapped to candidate mQTLs, and in none of the cases were a larger number of mQTLs or mQTL-mappable CpG probes obtained than with simply using quantile-normalized beta values; therefore, quantile normalized beta values were used throughout for further correlation analyses.
Cell proliferation effects on expression and methylation
DNA concentrations from 8 individuals were used to obtain a set of 42 developmentally significant genes whose expression strongly (R > 0.75) correlates with DNA concentration. The first principal component of expression levels for the set of these 42 genes was obtained and used as a vector estimating the level of cell proliferation effects in the full set of individuals. For each methylation probe correlated either to gene expression or sequence variation, we carried out linear regression with the probe’s beta values and the cell proliferation vector. The variance of the residuals was compared with the variance of the original methylation probe, and done so cumulatively across probes to obtain the total variation in methylation of correlated probes explained by cell proliferation effects (with separate categories for CpG sites correlated to DNA sequence and gene expression). The process was repeated with expression probes found to be correlated with eQTLs and/or methylation of adjacent CpG sites to obtain an estimation of the total variation in expression of correlated genes explained by cell proliferation effects.
Whole genome bisulfite sequencing
Whole genome bisulfite sequencing was carried out for cell lines GM02316, GM02317, GM02456, and GM02555 as described  with the modification that bisulfite conversion was carried out with the EZ DNA Methylation Kit (Zymo Research) according to the manufacturer’s protocol. We performed 100 bp paired-end sequencing on the Illumina Hiseq 2000 system; sequencing details are given in Additional file 13. Reads were mapped to the bisulfite converted reference genome using BWA and processed as described by Johnson et al..
Allelic expression measurement
AE measurement was carried out as described previously . In short, approximately 200 ng gDNA and 50 to 300 ng double-stranded cDNA were genotyped in parallel on Illumina Infinium HumanOmni1-Quad, or HumanOmni2.5-Quad microarrays. The cDNA synthesis protocol was applied on heteronuclear RNA, allowing the measure of unspliced primary transcripts. For cDNA synthesis approximately 150 μg of total RNA was enriched using the MicroPoly(A)Purist protocol (Ambion Inc., Streetsville, ON, Canada). First strand cDNA synthesis was carried out on 1 μg poly(A)-enriched RNA using random hexamers, and second strand cDNA synthesis was performed using the Superscript Double-Stranded cDNA Synthesis Kit (Invitrogen). Data were filtered removing non-expressed SNPs and SNPs where cDNA arrays were unable to discriminate between homozygous genotypes, and normalized to compensate observed intensity dependent shift in median beta values of cDNA versus gDNA. For filtered SNPs obtained in the assay, smoothed scores of allelic expression were assigned based upon an eight-state left-to-right hidden Markov model (LTOR-HMM) as described in . Based upon tests in which a null distribution was simulated by permuting raw allelic expression ratios independently within each sample, and on smoothed AE scores obtained from the LTOR-HMM, a threshold of 0.2 in at least two samples was identified as identifying allelically expressed SNPs with an FDR of 5%. Consecutive aeSNPs (ae-SNPs) having a smoothed AE value of the same sign and above this threshold in at least two individuals were aggregated into regions of allelic expression (aeRegions) and the mean smoothed AE score was obtained and assigned independently for each individual, in each aeRegion.
Imputation of HapMap genotypes and phasing of Infinium HumanOmni1 and HumanOmni2.5-derived genotyping data were done using Beagle . The SNPs used in correlation analysis throughout this study to obtain eQTLs, aeQTLs, mQTLs and emQTLs are all based upon this same set of SNPs.
Gene expression arrays and eQTL analysis
Gene expression levels for 58 of the 62 individuals were determined using the Illumina HumanRef-8 Expression BeadChip according to the manufacturer’s protocol, giving expression levels for 21,916 probes mapping to a total of 16,952 genes.
These expression values were quantile normalized, the genes filtered such that only those in the top 50% variance of expression were retained, and expression values of these genes correlated using Spearman’s correlation coefficient to all SNPs within the gene boundaries or up to 250 kb upstream of the TSS or downstream of the TES. Expression values of the top 50% variable genes were permuted and the correlation analysis repeated to obtain a null distribution of P-values, and a P-value of 1.4 × 10-5 was obtained as the cutoff yielding a 5% FDR.
Identifying allelic expression aeQTLs
All HapMap (release 28) SNPs at a distance of ±250 kb flanking each aeRegion and having a minor allele frequency >10% were correlated using Spearman’s correlation coefficient to their respective aeRegion. For each aeRegion, allelic expression values were permuted amongst the samples and the regression repeated to obtain an overall null distribution used in determining the FDR of P-values. A P-value threshold of 0.0029 was set based upon an FDR of 5%.
Identifying methylation quantitative trait loci
Only probes having variance across samples in the top 25% were kept for correlation analysis with SNPs. Spearman’s rho was calculated between the highest 25% variance probes and HapMap SNPs at a distance of ±250 kb flanking each CpG probe and having minor allele frequency >10%. For each variable CpG probe, the analysis was repeated with methylation values permuted across individuals, in order to obtain a P-value of 6 × 10-6 for an FDR cutoff of 5%.
The same set of top 25% variable methylation probes and top 50% variable genes in the Illumina HumanRef-8 Expression BeadChip were used, obtaining the Spearman correlation coefficient between any methylation probe located within the body of an annotated gene or up to 250 kb on either side. Expression levels for each gene were permuted across the samples and the same set of Spearman correlation coefficients obtained, in order to set the P-value cutoff of 5.132 × 10-5 for a 5% FDR.
A CpG probe was labeled as being in the 'TSS' a gene if it was ±1,500 bp from its TSS. It was labeled as 'body' if it was not located within 1,500 bp of any TSS but was within an annotated transcript. Finally, it was labeled as 'intergenic' if it was neither 'TSS' nor 'body'.
The percentage contribution of methylation and sequence variation to expression variation was assessed using five-fold cross-validation and step-wise feature selection. For the training subset (80% of individuals), a linear model with expression of a particular gene as the response variable and genotypes of SNPs in the neighborhood of that gene as explanatory variables was selected using the stepAIC function in R ; the model was then used to predict expression values in the testing subset (1/5) of individuals. The same procedure of training and predicting was used across all five folds, and the R2 between the expression values and the predicted expression values using the models was obtained as the percentage of expression variation explained by sequence variation. The same procedure was repeated with the residuals of the gene expression values from the sequence-expression model as response variables and methylation beta values of CpG probes in the neighborhood as explanatory variables, in order to obtain the percentage of expression variation explained by methylation variation.
Gene Ontology term enrichment
Significantly overrepresented GO categories were obtained for variable CpG probes and genes correlated to DNA methylation using Fisher’s exact test via GOStat , using default parameters available on the web server.
In the case of enrichment for highly variable CpG sites, genes with at least one top 25% variable CpG site at a TSS ± 1,500 bp were used as the test set; the set of all autosomal genes overlapping with at least one Illumina 450 K CpG probe was used as the background set. In the case of methylation-expression correlation, the set of all genes whose expression correlated significantly at 5% FDR with methylation of at least one CpG site were used in the test set; the set of all genes containing at least one CpG site within 250 kb was used as the background set. P-values were calculated by the GOStat web server, whereas fold enrichment was determined by dividing the proportion of genes in the test set with a given GO term by the proportion of genes in the background set with the same GO term.
Overlap with DNase I hypersensitivity and histone markers
Data were downloaded from the ENCODE Data Consortium Center at UCSC at  on 15 October 2012, namely UW Dnase I HS and UW Histone broad peak data for fibroblast cell lines: Ag04449, Ag04450, Bj, Hff and Hcfaa (only Ag04450 and Bj were available for H3k27me3). A genomic locus was defined as having a given mark if that mark was present in at least one of the three cell lines. For each variable gene, the sets of a) positively correlated methylation probes, b) negatively correlated methylation probes, and c) all probes in a 250 kb neighborhood were obtained. For each category, the average (across genes) proportion of probes overlapping each type of mark was determined.
Overlap between mQTLs and aeQTLs
The set of all SNPs that are categorized as correlated both to gene expression (aeQTL having relationship to an aeRegion and/or an eQTL correlated to a gene in the Ref8 array) and to DNA methylation (mQTL) at an FDR threshold of 5% in both of the respective analyses described above are categorized as methylation-regulatory SNPs (emQTLs).
Methylation, gene expression, and SNP genotyping data for this publication have been deposited in NCBI’s Gene Expression Omnibus and are accessible through GEO SuperSeries accession number GSE53261.
allelic expression quantitative trait locus
allelic expression single nucleotide polymorphism
DNase I hypersensitivity
expression and methylation quantitative trait loci
expression quantitative trait locus
false discovery rate
histone 3 lysine 27 tri-methylation
histone 3 lysine 4 tri-methylation
lymphoblastoid cell line
left-to-right hidden Markov model
methylation quantitative trait locus
transcription end site
transcription start site.
We thank Mathieu Lavallée-Adam for useful discussions. This work was supported by Canadian Institute of Health Research grants MOP-111246 awarded to TP and EP1-120608 awarded to TP and MB, and by a Natural Science and Engineering Research Council of Canada (NSERC) Discovery grant to MB. TP is a recipient of a Canada Research Chair Tier 2 award. SB is a recipient of a postdoctoral fellowship from the Réseau de Medecine Genetique Appliquée (RMGA). JRW is an NSERC Postgraduate Doctoral Scholar and a Walter C Sumner Memorial Fellow.
- Payer B, Lee JT: X chromosome dosage compensation: how mammals keep the balance. Annu Rev Genet. 2008, 42: 733-772. 10.1146/annurev.genet.42.110807.091711.PubMedView ArticleGoogle Scholar
- Yoder JA, Walsh CP, Bestor TH: Cytosine methylation and the ecology of intragenomic parasites. Trends Genet. 1997, 13: 335-340. 10.1016/S0168-9525(97)01181-5.PubMedView ArticleGoogle Scholar
- Li E, Beard C, Jaenisch R: Role for dna methylation in genomic imprinting. Nature. 1993, 366: 362-365. 10.1038/366362a0.PubMedView ArticleGoogle Scholar
- Baylin SB, Herman JG, Graff JR, Vertino PM, Issa JP: Alterations in DNA methylation: a fundamental aspect of neoplasia. Adv Cancer Res. 1998, 72: 141-196.PubMedView ArticleGoogle Scholar
- Kass SU, Landsberger N, Wolffe AP: DNA methylation directs a time-dependent repression of transcription initiation. Curr Biol. 1997, 7: 157-165. 10.1016/S0960-9822(97)70086-1.PubMedView ArticleGoogle Scholar
- Jones PA: The DNA methylation paradox. Trends Genet. 1999, 15: 34-37. 10.1016/S0168-9525(98)01636-9.PubMedView ArticleGoogle Scholar
- Jones PA: Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012, 13: 484-492. 10.1038/nrg3230.PubMedView ArticleGoogle Scholar
- Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, et al: The accessible chromatin landscape of the human genome. Nature. 2012, 489: 75-82. 10.1038/nature11232.PubMedPubMed CentralView ArticleGoogle Scholar
- Cedar H, Bergman Y: Linking DNA methylation and histone modification: patterns and paradigms. Nat Rev Genet. 2009, 10: 295-304. 10.1038/nrg2540.PubMedView ArticleGoogle Scholar
- Hashimshony T, Zhang JM, Keshet I, Bustin M, Cedar H: The role of DNA methylation in setting up chromatin structure during development. Nat Genet. 2003, 34: 187-192. 10.1038/ng1158.PubMedView ArticleGoogle Scholar
- Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK: DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 2011, 12: R10-10.1186/gb-2011-12-1-r10.PubMedPubMed CentralView ArticleGoogle Scholar
- Gibbs JR, van der Brug MP, Hernandez DG, Traynor BJ, Nalls MA, Lai S-L, Arepalli S, Dillman A, Rafferty IP, Troncoso J, Johnson R, Zielke HR, Ferrucci L, Longo DL, Cookson MR, Singleton AB: Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain. PloS Genet. 2010, 6: e1000952-10.1371/journal.pgen.1000952.PubMedPubMed CentralView ArticleGoogle Scholar
- Fraser HB, Lam LL, Neumann SM, Kobor MS: Population-specificity of human DNA methylation. Genome Biol. 2012, 13: R8-10.1186/gb-2012-13-2-r8.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang DD, Cheng LJ, Badner JA, Chen C, Chen Q, Luo W, Craig DW, Redman M, Gershon ES, Liu CY: Genetic control of individual differences in gene-specific methylation in human brain. Am J Hum Genet. 2010, 86: 411-419. 10.1016/j.ajhg.2010.02.005.PubMedPubMed CentralView ArticleGoogle Scholar
- van Eijk KR, de Jong S, Boks MPM, Langeveld T, Colas F, Veldink JH, de Kovel CGF, Janson E, Strengman E, Langfelder P, Kahn RS, van den Berg LH, Horvath S, Ophoff RA: Genetic analysis of DNA methylation and gene expression levels in whole blood of healthy human subjects. BMC Genomics. 2012, 13: 636-10.1186/1471-2164-13-636.PubMedPubMed CentralView ArticleGoogle Scholar
- Drong AW, Nicholson G, Hedman AK, Meduri E, Grundberg E, Small KS, Shin S-Y, Bell JT, Karpe F, Soranzo N, Spector TD, McCarthy MI, Deloukas P, Rantalainen M, Lindgren CM, Mol PC: The presence of methylation quantitative trait loci indicates a direct genetic influence on the level of DNA methylation in adipose tissue. PloS One. 2013, 8: e55923-10.1371/journal.pone.0055923.PubMedPubMed CentralView ArticleGoogle Scholar
- Gutierrez-Arcelus M, Lappalainen T, Montgomery SB, Buil A, Ongen H, Yurovsky A, Bryois J, Giger T, Romano L, Planchon A, Falconnet E, Bielser D, Gagnebin M, Padioleau I, Borel C, Letourneau A, Makrythanasis P, Guipponi M, Gehrig C, Antonarakis SE, Dermitzakis ET: Passive and active DNA methylation and the interplay with genetic variation in gene regulation. Elife. 2013, 2: e00523-10.7554/eLife.00523.PubMedPubMed CentralGoogle Scholar
- Sandoval J, Heyn HA, Moran S, Serra-Musach J, Pujana MA, Bibikova M, Esteller M: Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics. 2011, 6: 692-702. 10.4161/epi.6.6.16196.PubMedView ArticleGoogle Scholar
- Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3: 1724-1735.PubMedView ArticleGoogle Scholar
- Myers RM, Stamatoyannopoulos J, Snyder M, Dunham I, Hardison RC, Bernstein BE, Gingeras TR, Kent WJ, Birney E, Wold B, Crawford GE, Bernstein BE, Epstein CB, Shoresh N, Ernst J, Mikkelsen TS, Kheradpour P, Zhang XL, Wang L, Issner R, Coyne MJ, Durham T, Ku MC, Truong T, Ward LD, Altshuler RC, Lin MF, Kellis M, Gingeras TR, Davis CA, et al: A user's guide to the Encyclopedia of DNA Elements (ENCODE). PloS Biol. 2011, 9: e1001046-10.1371/journal.pbio.1001046.View ArticleGoogle Scholar
- Rosenfeld JA, Wang ZB, Schones DE, Zhao K, DeSalle R, Zhang MQ: Determination of enriched histone modifications in non-genic portions of the human genome. BMC Genomics. 2009, 10: 143-10.1186/1471-2164-10-143.PubMedPubMed CentralView ArticleGoogle Scholar
- Ge B, Pokholok DK, Kwan T, Grundberg E, Morcos L, Verlaan DJ, Le J, Koka V, Lam KCL, Gagne V, Dias J, Hoberman R, Montpetit A, Joly MM, Harvey EJ, Sinnett D, Beaulieu P, Hamon R, Graziani A, Dewar K, Harmsen E, Majewski J, Goring HHH, Naumova AK, Blanchette M, Gunderson KL, Pastinen T: Global patterns of cis variation in human cells revealed by high-density allelic expression analysis. Nat Genet. 2009, 41: 1216-U1278. 10.1038/ng.473.PubMedView ArticleGoogle Scholar
- Wagner JR, Ge B, Pokholok D, Gunderson KL, Pastinen T, Blanchette M: Computational analysis of whole-genome differential allelic expression data in human. PLoS Comput Biol. 2010, 6: e1000849-10.1371/journal.pcbi.1000849.PubMedPubMed CentralView ArticleGoogle Scholar
- Veyrieras JB, Kudaravalli S, Kim SY, Dermitzakis ET, Gilad Y, Stephens M, Pritchard JK: High-resolution mapping of expression-QTLs yields insight into human gene regulation. PloS Genet. 2008, 4: e1000214-10.1371/journal.pgen.1000214.PubMedPubMed CentralView ArticleGoogle Scholar
- Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, Montgomery S, Tavare S, Deloukas P, Dermitzakis ET: Population genomics of human gene expression. Nat Genet. 2007, 39: 1217-1224. 10.1038/ng2142.PubMedPubMed CentralView ArticleGoogle Scholar
- Almlof JC, Lundmark P, Lundmark A, Ge B, Maouche S, Goering HHH, Liljedahl U, Enstrom C, Brocheton J, Proust C, Godefroy T, Sambrook JG, Jolley J, Crisp-Hihn A, Foad N, Lloyd-Jones H, Stephens J, Gwilliam R, Rice CM, Hengstenberg C, Samani NJ, Erdmann J, Schunkert H, Pastinen T, Deloukas P, Goodall AH, Ouwehand WH, Cambien F, Syvanen A-C, Consortium C, et al: Powerful identification of cis-regulatory SNPs in human primary monocytes using allele-specific gene expression. PloS One. 2012, 7: e52260-10.1371/journal.pone.0052260.PubMedPubMed CentralView ArticleGoogle Scholar
- Veyrieras JB, Gaffney DJ, Pickrell JK, Gilad Y, Stephens M, Pritchard JK: Exon-specific QTLs skew the inferred distribution of expression QTLs detected using gene expression array data. PloS One. 2012, 7: e30629-10.1371/journal.pone.0030629.PubMedPubMed CentralView ArticleGoogle Scholar
- Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012, 485: 376-380. 10.1038/nature11082.PubMedPubMed CentralView ArticleGoogle Scholar
- Xiang L, Kong B: PAX8 is a novel marker for differentiating between various types of tumor, particularly ovarian epithelial carcinomas (Review). Oncol Lett. 2013, 5: 735-738.PubMedPubMed CentralGoogle Scholar
- Faghihi MA, Wahlestedt C: Regulatory roles of natural antisense transcripts. Nat Rev Mol Cell Biol. 2009, 10: 637-643. 10.1038/nrm2738.PubMedPubMed CentralView ArticleGoogle Scholar
- Li Q, Seo J-H, Stranger B, McKenna A, Pe’er I, LaFramboise T, Brown M, Tyekucheva S, Freedman ML: Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell. 2013, 152: 633-641. 10.1016/j.cell.2012.12.034.PubMedPubMed CentralView ArticleGoogle Scholar
- Moarefi AH, Chedin F: ICF syndrome mutations cause a broad spectrum of biochemical defects in DNMT3B-mediated de novo DNA methylation. J Mol Biol. 2011, 409: 758-772. 10.1016/j.jmb.2011.04.050.PubMedView ArticleGoogle Scholar
- Wiench M, John S, Baek S, Johnson TA, Sung M-H, Escobar T, Simmons CA, Pearce KH, Biddie SC, Sabo PJ, Thurman RE, Stamatoyannopoulos JA, Hager GL: DNA methylation status predicts cell type-specific enhancer activity. EMBO J. 2011, 30: 3028-3039. 10.1038/emboj.2011.210.PubMedPubMed CentralView ArticleGoogle Scholar
- Chang HY, Chi JT, Dudoit S, Bondre C, van de Rijn M, Botstein D, Brown PO: Diversity, topographic differentiation, and positional memory in human fibroblasts. Proc Natl Acad Sci U S A. 2002, 99: 12877-12882. 10.1073/pnas.162488599.PubMedPubMed CentralView ArticleGoogle Scholar
- Koch CM, Suschek CV, Lin Q, Bork S, Goergens M, Joussen S, Pallua N, Ho AD, Zenke M, Wagner W: Specific age-associated DNA methylation changes in human dermal fibroblasts. PLos One. 2011, 6: e16679-10.1371/journal.pone.0016679.PubMedPubMed CentralView ArticleGoogle Scholar
- HumanMethylation450 BeadChip achieves breadth of coverage using two infinium chemistries. [http://www.illumina.com/documents/products/technotes/technote_hm450_data_analysis_optimization.pdf]
- Bioconductor Surrogate Variable Analysis Package Description. [http://www.bioconductor.org/packages/release/bioc/html/sva.html]
- Kulis M, Heath S, Bibikova M, Queiros AC, Navarro A, Clot G, Martinez-Trillos A, Castellano G, Brun-Heath I, Pinyol M, Barberan-Soler S, Papasaikas P, Jares P, Bea S, Rico D, Ecker S, Rubio M, Royo R, Ho V, Klotzle B, Hernandez L, Conde L, Lopez-Guerra M, Colomer D, Villamor N, Aymerich M, Rozman M, Bayes M, Gut M, Gelpi JL, et al: Epigenomic analysis detects widespread gene-body DNA hypomethylation in chronic lymphocytic leukemia. Nat Genet. 2012, 44: 1236-1242. 10.1038/ng.2443.PubMedView ArticleGoogle Scholar
- Johnson MD, Mueller M, Game L, Aitman TJ: Single nucleotide analysis of cytosine methylation by whole-genome shotgun bisulfite sequencing. Curr Protocols Mol Biol. 2012, Chapter 21: Unit21.23-Google Scholar
- Browning BL, Browning SR: A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009, 84: 210-223. 10.1016/j.ajhg.2009.01.005.PubMedPubMed CentralView ArticleGoogle Scholar
- R stepAIC function description. [http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/stepAIC.html]
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics. 2004, 20: 1464-1465. 10.1093/bioinformatics/bth088.PubMedView ArticleGoogle Scholar
- ENCODE. [http://genome.ucsc.edu/ENCODE/downloads.html]
- Kernel density estimation. [http://stat.ethz.ch/R-manual/R-patched/library/stats/html/density.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.