Study sample
Samples were drawn from the Avon Longitudinal Study of Parents and Children [12, 13]. Blood from 1018 mother–child pairs (children at three time points and their mothers at two time points) were selected for analysis as part of the Accessible Resource for Integrative Epigenomic Studies (ARIES, http://www.ariesepigenomics.org.uk/) (Additional file 1: Table S1) [14]. Numbers of samples surviving the imposed quality control (QC) thresholds at each time point are shown in Additional file 1: Table S2. Sample selection was on the basis of sample availability at all of the chosen time points in mother–child pairs. Cord blood and peripheral blood samples (whole blood, buffy coats, white blood cells or blood spots) were collected according to standard procedures.
Methylation assays
Following DNA extraction, samples were bisulphite converted using the Zymo EZ DNA Methylation™ kit (Zymo, Irvine, CA, USA). Following conversion, genome-wide methylation was measured using the Illumina Infinium HumanMethylation450 (HM450) BeadChip. The arrays were scanned using an Illumina iScan, with initial quality review using GenomeStudio. During the data generation process a wide range of batch variables were recorded in a purpose-built laboratory information management system (LIMS). The LIMS also reported QC metrics from the standard control probes on the HM450 BeadChip for each sample. Samples failing QC were excluded from further analysis and the assay repeated. Data points with a low signal:noise ratio (detection p > 0.01) or with methylated or unmethylated read counts of 0 were also excluded from analysis. As an additional QC step genotype probes on the Ilumina Infinium HM450 array were compared between samples from the same individual at different time points and against SNP-chip data (HM450 probes clustered using k-means) to identify and remove any sample mismatches. Methylation data were normalised in R with the wateRmelon package [36] using the Touleimat and Tost [37] algorithm to reduce the non-biological differences between probes. Data were then rank-normalised to remove outliers and regressed on all covariates plus bisulphite-converted DNA (BCD) plate batch to remove potential batch effects (with missing values set to probe mean).
Genotyping assays
The ARIES participants were previously genotyped as part of the larger ALSPAC study, with QC, cleaning and imputation performed at the cohort level before extraction of the subset comprising ARIES. ARIES participants are of European white ancestry and homogeneous compared with HapMap reference populations (Additional file 1: Figure S18).
Children were genotyped using the Illumina HumanHap550 quad genome-wide SNP genotyping platform (Illumina Inc., San Diego, CA, USA) by the Wellcome Trust Sanger Institute (WTSI; Cambridge, UK) and the Laboratory Corporation of America (LCA, Burlington, NC, USA). Individuals were excluded on the basis of incorrect gender assignment, abnormal heterozygosity (<0.320 or >0.345 for WTSI data; <0.310 or >0.330 for LCA data), high missingness (>3 %), cryptic relatedness (>10 % identity by descent) and non-European ancestry (detected by multidimensional scaling analysis). Following QC, the final directly genotyped dataset contained 500,527 SNP loci.
Mothers were genotyped using the Illumina human660W-quad genome-wide SNP genotyping platform (Illumina Inc., San Diego, CA, USA) at the Centre National de Génotypage (CNG; Paris, France). Individuals were excluded based on non-European ancestry, missingness, relatedness, gender mismatches and heterozygosity. PLINK (v1.07) [38] was used to carry out quality control measures on an initial set of 10,015 subjects (including non-ARIES ALSPAC participants) and 557,124 directly genotyped SNPs. Following QC, the final directly genotyped dataset contained 526,688 SNP loci.
Imputation was performed to increase the SNP density for all genotyped mothers and children combined. Genotypes were phased together using ShapeIt (version 2, revision 727) and then imputed against the 1000 Genomes reference panel (phase 1, version 3, phased using ShapeIt version 2, December 2013, using all populations) using Impute (v2.2.2). Genotypes were filtered to have Hardy–Weinberg equilibrium p > 5 × 10−7, MAF >1 % and imputation info score >0.8. Best guess genotypes were used for subsequent analysis. The final imputed dataset used for the analyses presented here contained 8,074,398 loci.
Genotype/methylation association tests
Summary details of samples, SNPs, CpGs and covariates included in association tests are presented in Additional file 1: Table S2. Each SNP in the imputed datasets was analysed against all CpG sites in the Illumina Infinium HM450 array with the exception of those failing QC and those reported to map to more than one location (N = 19,834) or to contain a genetic variant at the CpG site (N = 74,182) [34]. We performed a post hoc test for analysis of non-specific probes which determined that there was no enrichment of cross-hybridising probes in our results. We opted for post hoc annotation of potential probe effects within our results database for the remaining potentially problematic CpG sites (N = 229,983) [34]. The final number of probes analysed was 395,625. Preliminary association analysis of SNPs with CpG sites was performed using an additive model (rank-normalised CpG methylation on SNP allele count) using Matrix eQTL [39] in order to perform the computationally demanding task of estimating 16 trillion associations. Analyses were batched by SNP chromosome and sample time point for parallel analysis on the University of Bristol High Performance Computing (HPC) cluster. SNP effects from this analysis that were p < 1 × 10−7 were then taken forward for re-analysis in PLINK1.07 to perform exact linear regression including covariates. Covariates included in all analyses were age (excluding birth), sex (children only), the top ten ancestry principal components, bisulphite conversion batch and estimated white blood cell counts (using an algorithm based on differential methylation between cell types [33]). Methylation at each CpG site was regressed on these covariates and residuals taken forward for regression on SNP genotype. To report the number of mQTL, we used a conservative threshold of 1 × 10−14. All associations below 1 × 10−7 were stored and are available in our online mQTL database (http://www.mqtldb.org/). Analysis on rank-normalised data results in effect sizes that are not directly interpretable on the original scale. Additional file 1: Figure S2 illustrates the effect size distributions observed in our data.
Without access to an appropriate replication sample, we performed all subsequent enrichment and downstream analysis using only cis mQTL with p < 1 × 10−14 (Additional file 1: Table S3) to reduce the possibility of including false positives (unless otherwise stated). We consider results below p < 1 × 10−14 in a single time point to provide informative evidence of association on the basis that this corresponds to a 0.2 % false positive rate after a Bonferonni correction for the number of tests based on directly genotyped SNPs and directly assayed CpG sites. We define replication in additional time points to be associations at p < 1 × 10−7 because typically we are testing in the order of 30,000 associations for replication and on the basis that these are supported by their combination with the evidence at other time points through the life course.
Code availability
Code to run comparable mQTL analyses with ARIES data is available under the GNU GPL (v3) license at https://github.com/MRCIEU/ariesmqtl.
Conditional mQTL analysis
Linkage disequilibrium is expected to cause many mQTL to be represented by multiple tag SNPs. We used the conditional analysis implementation in GCTA [19, 40] to determine the most representative independent loci associated with each CpG site. We used the summary statistics from the results of our PLINK analysis (i.e. any SNP–CpG pair with a p ≤ 1 × 10−14 at any time point) and entered them into a stepwise model using individual level genotype data to determine whether they independently account for association with the CpG site. Independent signals with p < 1 × 10−14 are considered significant.
Patterns of methylation across genomic features and time
The distribution of DNA methylation across CpG sites in different genic and CpG island regions was analysed using annotations from the Illumina Infinium HM450 manifest file (v1.2). Each probe was assigned to an annotation category (e.g. 5′ UTR, 3′ UTR, gene body, CpG island, etc.), the median methylation beta (proportion methylation) for that probe was calculated and then the distribution of these medians across mQTL-associated CpG sites in a category was plotted in a histogram.
To characterise the temporal pattern of methylation, the correlation of methylation between each pair of time points was calculated for each probe. The distribution of correlation coefficients (r2) was plotted in a histogram for each pair of time points. A null distribution of correlation for each pair was derived by randomising the sample order for one of the pair of time points, and similarly plotted as a histogram of correlation coefficients.
Patterns of mQTL across genomic features and time
The distribution of mQTL-associated CpG sites was analysed using annotations from the Illumina Infinium HM450 manifest file (v1.2). The proportion of CpG probes falling into each annotation category was plotted in a bar chart to evaluate the distribution across different genomic features. A similar approach was used for mQTL SNPs, which were annotated with features using the Variant Effect Predictor [41] (all associated SNPs were included).
SNP heritability estimation of methylation probes
In order to estimate the proportion of the variation in CpG probes that is due to genetic variation, we used SNP heritability analysis for each probe (after filtering) at each time point. This was performed using genomic restricted maximum likelihood (GREML) as implemented in GCTA (v1.24). The same covariates were fitted here as they were for the mQTL mapping analysis. Only unrelated individuals were used in these analyses (genetic relatedness <0.05) in order to obtain a population-based ‘SNP’ heritability estimate, which is typically an underestimate of true heritability in complex traits because the magnitude of the estimate is limited to the proportion of all genomic variation that is captured by the SNPs on the genotyping array. In this case we used HapMap3 [20] SNPs with MAF >0.01 and imputation quality score >0.8 (1,171,463 SNPs after filtering). The SNP heritability was estimated for each probe using two variance components, the first generated using only SNPs within ±1 Mb of the CpG site (the cis component) and the second generated using all remaining SNPs (the trans component), such that var(y) = var(g_cis) + var(g_trans) + var(e), where the genetic variance due to all SNPs var(g) = var(gi_cis) + var(g_trans), and var(g)/var(p) is the total SNP heritability for a probe at a particular time point. The same analysis was repeated for each probe at each time point.
We then make a distinction between ‘explained’ and ‘unexplained’ genetic variation (Fig. 2b). Some proportion of the total estimated SNP heritability has been ‘explained’ by the detected mQTLs, estimated by summing the R2 values for each conditionally significant association at p < 1 × 10−14 for a particular probe at a particular time point. Thus, we term the remaining genetic variation ‘unexplained’. The ‘explained’ genetic variation was estimated for both the cis and trans genetic components.
Proportion of trait variance explained by mQTL
In order to ascertain if mQTL contribute to variance in disease liability, a mixed model analysis was performed to calculate the proportion of the trait variance that could be explained by mQTL and non-mQTL in publicly available data from the first WTCCC publication [24]. The variance partitioning model:
$$ {y}_i=\mu +{g}_{i,1}+{g}_{i,2}+{e}_i $$
was fitted using REML as implemented in GCTA (v1.24) software [19]. Here, y
i
is the phenotype of individual i, g
i,1 is their genetic value due to mQTL, g
i,2 is their genetic value due to all HapMap3 [20] SNPs that were not identified as mQTL and e
i
is their residual value. To avoid results being influenced by MHC, both g
i,1 and g
i,2 were estimated excluding all SNPs on chromosome 6. This analysis was performed for seven traits (bipolar disorder, Crohn’s disease, type 1 diabetes, type 2 diabetes, coronary heart disease, hypertension, rheumatoid arthritis) using the WTCCC data [24]. The WTCCC data underwent the same quality control procedure described in [42]. Specifically, the data were split into seven case-control datasets where the controls were common across all traits; each dataset was filtered for MAF >0.01, genotype missingness <0.002, Hardy–Weinberg equilibrium p > 0.01 and any SNPs with differential missingness between cases and controls (p < 0.05) were removed. This left approximately 230,000 SNPs for each dataset. Pairwise genomic similarity was then estimated using scaled SNP similarity and any individual pairs with genomic relatedness >0.025 were removed. Principal components were estimated from the SNP data after removing regions of long-range LD and pruning to low LD and five rounds of outlier removal were used whereby if any individual had a value of mean ± 5 standard deviations in any of the first 20 components they were removed. Imputation was performed to 1000 Genomes reference panel (phase 1, version 3, release December 2013) in two stages: first, haplotypes of the WTCCC samples were estimated jointly using ShapeIt2 and then imputation was performed using Impute2. Then, g
i,2 was estimated using only the imputed SNPs that were present in the HapMap3 reference. To test if the mQTLs detected in ARIES made disproportionately large contributions to SNP heritability in the WTCCC datasets, we used only cis mQTL with p < 1 × 10−14 at any time point that were then LD pruned, leaving 29,805 SNPs in total. Generating an appropriate null distribution against which to compare the point estimate of the mQTL variance component in this analysis is important because the mQTL variance component may have been contaminated by variance due to genic SNPs rather than directly due to mQTL alone. To test for this possibility, we generated two different ‘null’ experiments. First, we sampled 29,805 SNPs that were within ±500 kb of a gene (‘genic SNPs’), matching the SNPs to have the same LD and allele frequency distributions as the mQTLs and performing the entire analysis again, but instead of using mQTLs, g
i,1 was constructed using these sampled SNPs. This was performed 100 times to get a distribution of estimates under the null hypothesis that proximity to genic regions is sufficient to explain the influence of mQTLs on the trait. Second, the same was done except instead of sampling from genic regions specifically, the null SNP set was generated to have the same number of SNPs in 5′ UTR, 3′ UTR, intronic, intergenic, upstream, downstream and unannotated regions of the genome (‘mQTL annotation-matched SNPs’). All SNP heritability estimates were transformed from the observed to the liability scale using the trait prevalence estimates used in [43]. Sex and the first 20 genetic principal components were included as covariates in all analyses.
Mediation of trans mQTL effects by cis methylation
Potential mediation of trans mQTL effects by cis methylation has been previously reported [21]. This type of analysis is particularly susceptible to measurement error in the cis (mediating) CpG methylation variable so requires cautious interpretation. For all trans associations we took the estimate of association from a linear regression of B ~ G, and then performed a second linear regression B ~ G + Kx, where B is methylation at the trans CpG site, G is the genotype (or allele score if more than one independent cis association) and K is methylation at the cis CpG site. Reduction in regression coefficient for genotype when cis CpG methylation was added to the model was evaluated as an estimate of potential mediation (given the limitations of measurement error). We estimated the likely impact of measurement error by performing a series of analyses in which we added increasing levels of simulated measurement error (multiplication by random values sampled from a normal distribution with mean 1 and a given standard deviation).
Enrichment analysis for mQTL in GWAS results
To test if the mQTL were enriched for low p values in previously published GWAS results, the following procedure was performed. Only independent mQTL with cis effects of p < 1 × 10−14 obtained from the conditional analysis and then filtered to remove SNPs with LD r2 > 0.1 were used to test for enrichment. We collected data from a set of 33 complex traits from online sources. All available mQTL were extracted from each trait and Fisher's method was used to estimate a combined p value for each trait:
$$ {\chi}_{2k}^2 \sim - 2{\displaystyle \sum_{i=1}^k \ln\ {p}_i} $$
In order to compare these p values to a realistic null distribution, the same procedure was performed but using 10,000 random draws of the same number of SNPs from (a) genic regions or (b) mQTL annotation-matched SNPs. In both cases SNPs were also matched to the mQTLs on allele frequency and number of proxies to adjust for LD structure. In order to match on LD, each mQTL was given an LD score by calculating the number of SNPs that were in LD r2 > 0.8 and the same was done for each of the candidate genic SNPs (approximately 200,000 genic SNPs). A null distribution for each GWAS trait was generated by performing this procedure 1000 times, each time making a new random draw of genic SNPs matched for allele frequency and LD. Empirical p values were generated by simply finding the rank of the meQTL p value among the 1000 null p values.
Gene ontology enrichment
mQTL SNPs were mapped to the nearest gene (by genomic coordinate) using the Variant Effect Predictor and lists of unique gene names were analysed for enrichment of gene ontology (GO) terms relative to a human reference using the gene ontology function (‘go’) in the Orange bioinformatics add-on (‘Orange.bio’ , http://orange.biolab.si/download/). CpG sites associated with mQTL were similarly (but separately) tested for GO term enrichment using the Illumina gene-name annotations for each CpG site. Analyses were repeated for time point-specific loci and those common to all time points. Potentially enriched GO terms (biological function) were plotted using REVIGO [44], where strength of evidence for enrichment (−log10(p value)) is proportional to size.
Comparison of blood mQTL with blood eQTL
Previously reported blood eQTL [22] were downloaded from http://www.gtexportal.org/ and compared with mQTL discovered at each time point to determine overlap. Cis QTL were considered to be shared if the same SNP was reported to be associated at p < 1 × 10−14 in mQTL data and p < 2.5 × 10−7 in eQTL data (the reported GTEx threshold). For GO analyses we compared our mQTL data with blood eQTL from Westra et al. [23] as these data enabled us to evaluate enrichment of common QTLs for both cis and trans; however, a lack of data prevented us from comparing the numeric overlap between the Westra et al. eQTL and our mQTL.
Ethics approval and consent to participate
Written informed consent has been obtained from all ALSPAC participants. Ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee (IRB00003312) and Local Research Ethics Committees in accordance with the guidelines of The Declaration of Helsinki.
Availability of data and materials
A mQTL database containing all results from this study is available online at http://www.mqtldb.org, which includes both search functionality and full download of all results at p < 1 × 10−7 using Matrix eQTL and GCTA. Data are also available for download at doi:10.5523/bris.r9bxayo5mmk510dczq6golkmb.
All data used in this study were provided to the authors by the ALSPAC cohort study via their standard Access Policy. Data are available to other bona fide researchers on the same basis; for details please see http://www.bristol.ac.uk/alspac/researchers/data-access/.