Human subjects
Informed written consent was obtained from all subjects prior to participation, and experimental methods complied with the Helsinki declaration. Scientific approval for the Caucasian studies was obtained under IRB protocol H-18849 at the Baylor College of Medicine. For the Gambian studies, the Scientific Coordinating Committee of MRC Unit, The Gambia, granted scientific approval and the joint Gambian Government/MRC Ethics Committee (SCC/EC 1151) and the London School of Hygiene and Tropical Medicine Ethics Committee (EC 5525) granted ethical permission for this study. Sample collection, study populations, and DNA isolation have been previously described [10,20,21].
Bisulfite-seq, SNP calling from Bisulfite-seq data, and filtering bins based on SNP score
Bisulfite-seq library preparation and sequencing were performed in the Baylor College of Medicine Human Genome Sequencing Center, and read mapping and data processing were performed as previously described [21]. The accuracy of our methylation calls by Bisulfite-seq has been quantitatively validated [21]. Combined bisulfite-seq data from HF and PBL samples from two individuals (C01 and C02) [21] were used to determine SNP scores for 200 bp bins genomewide. Reads were mapped to hg19 with BISMARK [38] and BisSNP (v.0.82.2) [39] was run to call SNPs on each combined sample (PBL + HF) using the default settings and dbSNP 135. The VCFpostprocess tool of BisSNP was used with the default settings to filter the raw SNP calls. SNP scores were assigned at each locus for which there was a SNP called in either C01 or C02. A score of 1 was assigned to those loci at which both C01 and C02 had the same SNP call. A score of 0 was assigned otherwise. Average SNP scores were calculated for all 200-bp bins genomewide. Genomewide bins were defined as those containing at least 2 CpG sites and which had mapping coverage [21] in at least one of the four C01/C02 HF/PBL samples (N = 5,257,320). A bin’s SNP score was set to 1 if no SNP was called in either individual.
Formulation of the systemic interindividual variation index
$$ x = \mathrm{P}\mathrm{B}\mathrm{L}\ \mathrm{residual}\ \left(\%\mathrm{meth}\ \mathrm{in}\ \mathrm{C}01\ \hbox{-}\ \%\mathrm{meth}\ \mathrm{in}\ \mathrm{C}02\right) $$
$$ y = \mathrm{H}\mathrm{F}\ \mathrm{residual}\ \left(\%\mathrm{meth}\ \mathrm{in}\ \mathrm{C}01\ \hbox{-}\ \%\mathrm{meth}\ \mathrm{in}\ \mathrm{C}02\right) $$
$$ \mathrm{SIVI}=\mathrm{A}+\mathrm{B}+\mathrm{C} $$
Where
$$ \mathrm{A}=\sqrt{\left(\left|x\cdot y\right|\right)}\left(\mathrm{rewards}\ \mathrm{maximal}\ \mathrm{interindividual}\ \mathrm{differences}\right) $$
$$ \mathrm{B}=-sd\left(x,y\right)\left(\mathrm{rewards}\ \mathrm{in}\mathrm{terindividual}\ \mathrm{differences}\ \mathrm{that}\ \mathrm{are}\ \mathrm{similar}\ \mathrm{in}\ \mathrm{both}\ \mathrm{tissues}\right) $$
$$ \mathrm{C}=- \max \left(sd{\left(\% met{h}_{PBL},\% met{h}_{HF}\right)}_{C01},sd{\left(\% met{h}_{PBL},\% met{h}_{HF}\right)}_{C02}\right)\left(\mathrm{rewards}\ \mathrm{consistent}\ \mathrm{percent}\ \mathrm{methylation}\ \mathrm{across}\ \mathrm{both}\ \mathrm{tissues}\right) $$
Targeted analysis of BluePrint epigenome data
Bisulfite-seq methylation calls from each of six monocyte samples (C000S5, C0010K, C001UY, C004SQ, C005PS, S000RD) from the Blueprint Epigenome project [22] were placed into 200 bp bins and an average methylation score was determined for each bin. Each bin which had a methylation score in at least two samples was then assigned a range score, defined as the difference between the highest and lowest methylation score of all informative samples.
Comparison of genomic features among metastable epiallele and non-metastable epiallele genomewide bins
The distances from each non-ME and ME bin genomewide were determined for all SINE, LINE, and ERV elements (UCSC RepeatMasker track, hg19) and CGIs (UCSC CpG islands track, hg19). Non-ME bins (N = 298,979) were defined as those having a SIVI score between -5 and 5, a SNP score of 1, no overlap with segmental duplication (UCSC, hg19), and at least six CpG sites. ME bins (N = 109) had a minimum SIVI of 20 but otherwise the same characteristics. A 20 kb window centered on the midpoint of each bin was divided into 40 intervals of 500 bp each. The overlap score is the number of each type of individual elements that overlap the interval. The 'normalized overlap score' was calculated by dividing the raw overlap score by the total number of bins in each set. Chi-squared tests were performed on the raw overlap scores for CGIs and SINE, LINE, and ERV elements. A chi-squared test was also performed to determine the significance of the localization of ME bins to subtelomeric regions (defined as the 1 Mb flanking each 10 kb telomere).
Bisulfite pyrosequencing
Quantitative analysis of CpG site-specific DNA methylation at VTRNA2-1 was performed by bisulfite pyrosequencing [40]. The pyrosequencing assay was validated using standards composed of known mixtures of methylated and unmethylated human genomic DNA [41] (Figure S7 in Additional file 1). The primers used were as follows: forward TGAAGGTGTGATAGAAAGTATG, reverse (Biotin) ACATTTTTTTATCCCCATA, sequencing AGTATGGAGGTTGGTTATT.
450k array hybridization, data processing, and analysis of the Gambian sample cohort
Sample preparation and hybridization to the Illumina HumanMethylation450 BeadChip arrays were performed at the Genetics Services Platform of the International Agency for Research on Cancer, according to manufacturer’s instructions. Data processing and analysis were performed as follows: 1) pre-filtering and quality control (QC); 2) color adjustment, probe-type bias correction and inter-sample quantile normalization; 3) batch effect correction and evaluation of alternative QC pipelines; 4) data-driven estimation of white blood cell counts; 5) replicate removal and outlier detection; 6) removal of probes close to SNPs; 7) bump hunting analysis.
Pre-filtering and quality control
Raw data for 485,577 CpG probes on the 450k array were loaded from IDAT files (n = 124 samples including three technical replicates). Array-wide two-dimensional multi-dimensional scaling plots clustered into two groups representing infant sex, confirming that recorded sex was correct for all samples. We then removed 11,656 probes on X and Y chromosomes, together with 57 SNP probes provided to detect potential sample mix-ups; 473,864 probes remained.
A further probe filtering step was implemented based on probe detection P-values. Using a detection P-value threshold of P = 0.01, the maximum sample failure rate was 0.00383 (sample ID = 9007225117_R06C01; Figure S8 in Additional file 1). This sample also appears as an outlier on principal component analysis plots of array-wide methylation (Figure S9 in Additional file 1) and was removed from all subsequent analysis; n = 123 samples remained. We also removed 6,600 probes that failed in one or more samples (using the same detection P-value threshold), leaving 467,264 probes for subsequent analysis. Additional quality control checks (overall signal intensity (M + U) across samples, M-value distributions and multi-dimensional scaling plot excluding X and Y probes) revealed no further issues.
Color adjustment, probe-type bias correction and inter-sample quantile normalization
Standard pre-processing adjustments as described in [42] were applied as follows: a) correction for probe color bias (lumi col adj), b) inter-sample quantile normalization (lumi QN), and c) probe type correction using beta mixture quantile dilation (BMIQ) [43]
Batch effect correction and evaluation of alternative quality control pipelines
The following alternative QC pipelines with and without correction for batch covariates (sample plate, sample slide, sample position on slide) were considered:
-
p1A: col adj + QN + BMIQ + batch correction (using ComBat [44])*
-
p1B: col adj + QN + BMIQ + batch correction, adjusting for SoC (using ComBat [44])*
-
p2: col adj + BMIQ + adjust for batch covariates on locus-by-locus basis**
-
p3: col adj + QN + BMIQ adjust for batch covariates on locus-by-locus basis**
-
p4: col adj + QN + BMIQ (no adjustment for batch covariates)
-
p5: col adj + BMIQ (no adjustment for batch covariates)
*Batch correction using ComBat can be applied with or without adjusting for the variable of interest (in our case SoC).
**For p2 and p3, batch effects are first estimated in a linear multiple regression including batch covariates and SoC. Adjusted data are then the residual variation after adjusting for estimated batch effects only.
The performance of each pipeline was evaluated by comparing 450k beta values with percentage methylation estimates obtained from pyrosequencing data at five ME loci overlapping 450k CpG probes (Figure S3 in Additional file 1). Each pipeline was further evaluated by comparing 450k beta values between three pairs of technical replicates on the 450k array. Bland-Altman plots [45] illustrating replication performance across the full range of percentage methylation for each of three technical replicates using ‘optimal’ QC pipeline (p4) are presented in Figure S10 in Additional file 1.
Summary of QC pipeline evaluation
-
The p4 (col adj + QN + BMIQ) and p5 (col adj + BMIQ) approaches gave the best performance when comparing 450k and pyrosequencing methylation estimates at overlapping loci.
-
Neither of these methods (p4 or p5) feature direct batch adjustment, although p4 does remove a substantial amount of batch variation through inter-sample QN.
-
Regression-based locus-by-locus adjustment for batch covariates gives relatively poor 450k versus pyrosequencing correlations.
-
We selected p4 (col adj + QN + BMIQ) as our optimal pipeline. This achieves a Pearson correlation of 0.92 (Spearman R = 0.94) between 450k and pyrosequencing assays, with 88% of beta values differing by less than 15% between the two platforms. These figures compare favorably with a similar analysis [45] performed across 340 CpGs from 4 human breast cancer cell lines (Spearman R = 0.88; 81% with beta difference <15%).
Data-driven estimation of white blood cell counts
Interpretation of analyses investigating DMRs or differentially methylated positions (DMPs) in whole blood should be treated with caution, due to the possibility of confounding by white blood cell (WBC) type. We obtained methylation data-driven estimates of WBC-type composition using the method described by Jaffe and Irizarry [27]. These estimates are stable across technical replicates (Figure S11 in Additional file 1).
There is some evidence for potential confounding by WBC in our data. First, WBC type may be weakly associated with SoC (Figure S4 in Additional file 1). Secondly WBC composition is strongly associated with principal components explaining a large portion of genomewide variation in methylation (Table S6 in Additional file 2). We therefore performed our DMR analysis using bump hunting with adjustment for estimated WBC composition. An independent study [27] identified probes on the 450k array that are differentially methylated according to WBC type. None of the probes mapping to VTRNA2-1 in our study fall within this set (Table S5 in Additional file 2).
Replicate removal and outlier detection
For each of the three technical replicates, we remove the replicate with the highest probe fail rate. One outlier is also removed (see ‘Pre-filtering and quality control’ section above), leaving n = 120 samples for final analysis.
Removal of probes close to SNPs
450k probes within 10 bp of a common African SNP, defined as ‘AFR’ designated polymorphisms with a minor allele frequency >1% using data from the 1000 Genomes Project [46] were removed. We identified 42,435 such probes, so that 424,829 CpGs remain for the bump hunting analysis. SNP filtering was performed using the R ‘Illumina450ProbeVariants.db’ package.
Bump hunting analysis
The bump hunting method [26] measures differential methylation across pre-defined clusters of neighboring CpGs. We used the recommended proximity-based criteria for defining clusters so that each cluster contains a minimum of seven probes on the 450k array, with each probe located within 300 bp of its nearest neighbour. After filtering of SNP-proximal probes (see previous section), this resulted in a total of 10,394 clusters, with a maximum cluster size of 104 CpGs. We performed the bump hunting analysis with methylation pre-adjusted for sex and WBC composition, since the inclusion of adjustment covariates in the bumphunter linear model is not recommended (see [47]). M-values (logit-transformed beta-values) are used in place of untransformed beta-values throughout [48].
The bump hunting method first fits a linear model at each CpG within a cluster, with M-value as the outcome variable and SoC as the predictor variable. A smooth curve (loess) is then fitted to the estimated SoC coefficients across each cluster. Regions where the smoothed estimated coefficients deviate far from zero were considered candidate DMRs. Specifically, these are defined as groups of neighboring probes whose smoothed absolute coefficients exceed the 99th percentile of all estimated coefficients. To form a null distribution that accounts for (a) correlations between probes, (b) differences in cluster size and (c) potential non-normal distribution of model errors, this process is repeated 1,000 times with the outcome variables (M-values) permuted. The permutation P-value for a specific DMR is then the proportion of candidate DMRs across all permutations that had both a larger mean absolute coefficient and a larger length (number of probes) than the empirically observed DMR. We additionally report FWER-adjusted P-values, which are the proportion of permutations having at least one region as extreme as the empirically observed DMR.
Analysis of associations between infant methylation at VTRNA2-1 and season of conception and maternal 1-carbon biomarker concentrations
VTRNA2-1 methylation assayed at three CpG sites by bisulfite pyrosequencing was observed to be highly correlated (mean Spearman R = 0.93). Because percentage methylation values followed a distinct bimodal pattern, we used mean methylation, dichotomized at 40% [28], as our outcome variable. Association of SoC with dichotomized mean methylation was assessed using a Pearson chi-squared test. For testing associations of maternal biomarkers with dichotomized mean methylation, all biomarkers were back-extrapolated to time of conception, adjusted for gestational age, and analyzed in the logarithm, as described previously [10]. Biomarker associations were analyzed in a logistic regression model using the glm function in R.
Data availability
The Gene Expression Omnibus (GEO) accession number for the raw sequence reads for the four Bisulfite-seq libraries is GSE44806. The GEO accession number for the original 450k data sets is GSE59592.