 METHOD
 Open Access
 Published:
Controlling bias and inflation in epigenome and transcriptomewide association studies using the empirical null distribution
Genome Biology volume 18, Article number: 19 (2017)
Abstract
We show that epigenome and transcriptomewide association studies (EWAS and TWAS) are prone to significant inflation and bias of test statistics, an unrecognized phenomenon introducing spurious findings if left unaddressed. Neither GWASbased methodology nor stateoftheart confounder adjustment methods completely remove bias and inflation. We propose a Bayesian method to control bias and inflation in EWAS and TWAS based on estimation of the empirical null distribution. Using simulations and real data, we demonstrate that our method maximizes power while properly controlling the false positive rate. We illustrate the utility of our method in largescale EWAS and TWAS metaanalyses of age and smoking.
Background
The largescale analysis of epigenome and transcriptome data in population studies is thought to answer fundamental questions about genome biology and will be instrumental in linking genetic and environmental influences to disease etiology [1, 2]. Worldwide, research groups are now joining forces to generate and analyze such data [3–7] complementary to the vast resources of genetic data that are already available and have been used successfully in genomewide association studies (GWAS). While the analysis tool box for GWAS has matured, the development of effective methodology for the analysis of epigenome and transcriptomewide association studies (EWAS and TWAS) is a nascent field of research. In an EWAS, DNA methylation levels of typically hundreds of thousands of CpG dinucleotides are individually tested for association with an outcome of interest, while in a TWAS this is done for expression levels of tens of thousands of genes. Currently, EWAS and TWAS analysis heavily relies on approaches specifically designed for GWAS. However, epigenome and transcriptome data are crucially different from genetic data. They are quantitative measures (and not discrete like genotypes) that are subject to major confounding effects of technical batches and biological influences, including cellular heterogeneity [2, 8]. Furthermore, molecular phenotypes such as DNA methylation and gene expression often show stronger associations with phenotypic traits or complex diseases than genotypic markers.
A key aspect of the analysis of omewide association studies is the control of teststatistic inflation. Inflation of test statistics leads to an overestimation of the level of statistical significance and dramatically increases the number of false positive findings [9]. This has always been a major concern in GWAS, but inflated test statistics are also observed in EWAS [10, 11]. Often the level of inflation exceeds that observed in GWAS, yet it is generally not corrected [12]. In GWAS, teststatistic inflation is commonly addressed using genomic control in which the inflated test statistics are divided by the genomic inflation factor. The genomic inflation factor estimates the amount of inflation by comparing observed test statistics across all genetic variants to those expected under the hypothesis of no effect [9]. Recent work pointed out crucial limitations of genomic control in GWAS [13, 14]. Notably, the genomic inflation factor was shown to provide an invalid estimate of teststatistic inflation when the outcome of interest is associated with many, small genetic effects [13]. In EWAS and TWAS, this is the rule rather than the exception. Moreover, test statistics may not only be subject to inflation but also to bias [15], which is not corrected for when using genomic control. Bias of test statistics leads to a shift in the distribution of effect sizes and is driven by confounding [16, 17], a prominent feature of EWAS and TWAS but much less of a concern in GWAS [18]. Thus, this calls for the development of new methods specifically designed to address teststatistic inflation and bias in EWAS and TWAS analyses.
Although generally ignored, genomic control will overestimate the actual inflation unless it is estimated on the basis of genetic variants not associated with the outcome of interest [9, 19]. A Bayesian outlier model [20] was proposed to solve this issue; it estimates inflation while assuming a fixed and small number of 10 associated genetic variants. Although this is an improvement for GWAS with few associations, it will not be sufficient to solve the overestimation of inflation in EWAS and TWAS, which typically yield substantially more associations. Nor does it address the occurrence of teststatistic bias. In the statistical literature, alternative methods have been proposed in the context of largescale multiple hypothesis testing where an empirical null distribution is used for inference [16, 21–23]. The utility of these approaches in EWAS and TWAS, however, remains to be evaluated.
Here, we use simulation studies and largescale methylome (n=2203) and transcriptome (n=1910) data [24, 25] to show that correcting inflated test statistics by applying genomic control is too conservative for EWAS and TWAS and that teststatistic bias cannot be ignored. Moreover, we demonstrate that teststatistic bias and inflation are represented by the mean and standard deviation of the empirical null distribution and propose a Bayesian method for its estimation. Application of stateoftheart batch correction methods, including SVA [26], RUV [27], and CATE [17], were not able to remove all teststatistic bias and inflation. Hence, the resulting test statistics require empirical calibration to achieve optimal statistical power while controlling the number of false positives at the desired level. We develop a Bayesian method for estimation of the empirical null distribution and propose a bias and inflation correction implemented as an R/Bioconductor [28, 29] package BACON. Finally, we show the utility of our method by performing an EWAS and TWAS metaanalysis of two commonly studied outcomes: age and smoking status.
Results
The genomic inflation factor is not suitable to measure inflation in EWAS/TWAS
We performed an EWAS and TWAS of age and smoking status using subsets of 500 individuals from two population cohorts, namely the Leiden Longevity Study (LLS) and LifeLines (LL) (Additional file 1: Table S1). The analyses were adjusted for known technical and biological covariates (including measured white blood cell counts) within a linear model framework. Inflation of test statistics was observed in all of the eight analyses (two cohorts, two data types, and two outcomes; Fig. 1). The amount of inflation estimated using the commonly used genomic inflation factor [9] varied substantially across analyses and ranged from 1.33 to 1.72 for the EWAS and from 1.21 to 1.54 for the TWAS (Fig. 1).
However, the genomic inflation factor appeared to be correlated with the expected number of true associations. For example, the genomic inflation factor was higher for age than smoking status, and previous studies showed that age is associated with many more differentially methylated sites and differentially expressed genes than smoking status [3–7]. For the analysis of age, the genomic inflation factor was higher for LL than LLS, which can be attributed to the higher statistical power for LL (age range 21 years) than LLS (age range 9 years).
A simulation study substantiated the impression that the genomic inflation factor depends on the number of true associations (Fig. 2). In fact, this result can be derived mathematically [9]. We conclude that the genomic inflation factor commonly overestimates the true level of teststatistic inflation in EWAS and TWAS.
EWAS/TWAS not only suffer from inflation but also from teststatistic bias
While quantilequantile plots of expected versus observed test statistics, or their corresponding P values, are frequently used to visualize inflation (Fig. 1), the alternative representation through a histogram of test statistics reveals a second artifact, namely a bias of the test statistics (Fig. 3 a and Additional file 1: Figure S1). This bias is visible as a deviation of the mode of the observed test statistics from zero, which is the mode of the standard normal distribution. Since the majority of features (being genetic variants, CpGs, or genes) are assumed not to be associated with the outcome of interest, test statistics obtained from a linear model should follow a standard normal distribution (i.e., centered at zero). We observed teststatistic bias in the EWAS and TWAS of age and smoking irrespective of cohort and outcome (Additional file 1: Figure S1). Genomic control does not address bias because it uses a normal distribution with the mode fixed at zero (Additional file 2). The misspecification of the observed distribution of test statistics by genomic control is illustrated in Fig. 3 c. Note that even permutationbased approaches, which are often assumed to rescue violations of assumptions regarding the theoretical null distribution, do not result in a proper null distribution, and both teststatistic bias and inflation persist [16, 30] (Fig. 3 d). We mathematically derived that unobserved confounding factors introduce bias in the analysis of highdimensional data (Additional file 2), thus expanding on earlier work by Rao [15].
Estimating teststatistic bias and inflation
Both bias and inflation represent deviations from the theoretical null distribution: bias (i.e., mean), a deviation from zero, and inflation (i.e., standard deviation) (Additional file 2). Hence, estimating the amount of bias and inflation is identical to estimating the parameters of the empirical null distribution. We developed a Bayesian method to estimate the empirical null distribution from an observed set of test statistics and thus simultaneously obtain estimates of bias and inflation. The method fits a threecomponent normal mixture to the observed set of test statistics using a Gibbs sampling algorithm [31]. One component reflects the null distribution with mean and standard deviation representing bias and inflation. The other two components with a positive and a negative mean capture the fraction of true associations observed in the data, which is assumed to be an unknown minority of tests (Fig. 3 b, Fig. 4, and Additional file 1: Figure S2). Hence, our method simultaneously provides estimates for the amount of bias and inflation without being affected by an unknown proportion of true associations (Additional file 1: Figure S3). We compared our method to alternative approaches for estimation of the empirical null distribution [16] in a simulation study. This showed that the performance of our method is equal to or better than those of the previous methods under various scenarios. Moreover, our method resulted in the most stable estimation of the inflation, which suggests that other methods randomly over or underestimate the level of inflation (Additional file 1: Figure S4 and Additional file 3).
Correction for unobserved covariates reduces teststatistic bias and inflation
The primary causes of inflation and bias are thought to be unmeasured technical and biological confounding [8, 16], e.g., population substructure, batch effects, and cellular heterogeneity. Various methods have been developed to reduce the impact of these unmeasured factors in highdimensional data [17, 26, 27, 32–34]. We applied six of these methods to adjust an EWAS and TWAS of age in 500 individuals, a subset of the LLS cohort, and investigated their impact on teststatistic bias and inflation. All approaches reduced the amount of bias and inflation as compared with a model using known covariates only (Table 1 and Additional file 1: Table S2). Nevertheless, residual bias and inflation were observed. Therefore, we designed a twostage method in order to preserve statistical power while appropriately controlling the number of false positives. First, we performed an analysis that corrects for known biological and technical covariates plus estimated unobserved covariates, followed by estimating and adjusting the residual bias and inflation using the empirical null distribution. In the adjustment step, P values are calculated using the empirical null distribution instead of the standard normal or the inflated normal that is used by the genomic control method. A complication of the genomic inflation factor is that it estimates the variance of the null distribution (\(\lambda _{{\chi ^{2}_{1}}}\)), whereas the standard deviation (\(\sqrt {\lambda _{{\chi ^{2}_{1}}}}\)) is required for genomic control on normally distributed test statistics resulting from linear models with a continuous outcome (here, DNA methylation and gene expression data). Furthermore, it is important to note that bias not only results in incorrect test statistics and P values but also results in biased estimates of effect sizes. To evaluate the performance of the twostage method, we conducted a numerical simulation. To account for unmeasured confounding, we selected CATE, a stateoftheart method that was shown to have superior performance in estimating unobserved covariates as compared with alternative methods [17]. Our Bayesian method in combination with CATE yielded the highest power with the fraction of false positives close to the nominal level (0.058±0.0052). In contrast, methods that ignore unobserved covariates led to high false positive rates and methods that use genomic control resulted in low power (Table 2). Also, the teststatistic calibration that has been proposed to use in combination with CATE [17] was conservative, resulting in low power, which is in line with the fact that this method is closely related to genomic control.
In addition to confounding, correlation between features (i.e., CpGs and genes) may cause teststatistics inflation or bias. A second simulation study showed that if test statistics are correlated, our Bayesian method properly controls the false positive rate while preserving power (Table 3). Again, the application of genomic control is too conservative (Table 3 and Additional file 3).
Fixedeffect metaanalysis with control for bias and inflation
A main development in the field of EWAS and TWAS, analogous to current practice in GWAS, is the combined analysis of multiple population studies to detect an increasing number of associations including those with small effect sizes. Fixedeffect metaanalysis combines estimated effect sizes and their standard errors from different studies to construct pooled estimates resulting in higher precision of effectsize estimates and hence superior statistical power [35, 36]. We performed an EWAS and TWAS of age and smoking status in four cohorts totaling 2203 individuals with methylome and 1910 individuals with transcriptome data, respectively. We combined the results from the four cohorts through fixedeffect metaanalysis (Table 4 and Additional file 1: Figure S5). As observed earlier, bias and inflation remained present after addressing unmeasured confounding using CATE. Also estimates of inflation using genomic control were both higher and considerably more variable across analyses and cohorts than the estimates obtained using our Bayesian method (Table 4). The Bayesian method fully removed all bias and inflation. Critically, bias (< 0.03) and inflation (< 1.14) remained minimal in the metaanalysis as compared with a metaanalysis using genomic control (Table 4). The latter contrasts to approaches in which inflation is not addressed at all or those using genomic control: both can result in high levels of inflation and bias in the metaanalysis that often are considerably higher than in the individual cohorts. The top hits identified for age and smoking included those consistently reported in previous studies [3–7]. Furthermore, the simultaneous performance of an EWAS and TWAS in a largescale metaanalysis showed a remarkable overlap in results between the two study types of 410 and 57 genes for age and smoking, respectively (assigning the nearest gene to a CpG site) (Additional files 4–7: Tables S3ad). For example, both DNA methylation near and expression of CD248, DNMT3A, and FBLN2 were associated with age (Fig. 5 a), while the same was true for GPR15, AHRR and CLDND1 for smoking (Fig. 5 b). In total 15,967 (3.5%) CpG sites and 1020 (2.7%) genes were significantly associated with age (Bonferronicorrected P values < 0.05). For smoking, the number of associated CpGs and genes were 1128 (0.25%) and 301 (0.80%), respectively.
We implemented our Bayesian method as an R/Bioconductor [28, 29] package BACON. BACON provides valid estimates of bias and inflation in largescale analyses including EWAS and TWAS, yields corrected test statistics, and supports the streamlined application of the method to fixedeffect metaanalyses.
Discussion and conclusion
We describe a novel Bayesian method to detect and correct for bias and inflation in epigenome and transcriptomewide association studies. Our method has the crucial characteristic that it is largely independent of the fraction of true associations in the data. We showed that the application of genomic control results in spurious associations because it does not address bias and, moreover, reduces power because it is sensitive to the number of true associations and thus commonly overestimates the levels of inflation. The performance of our method towards estimating the empirical null distribution of test statistics outperforms existing methods [16] by taking advantage of prior knowledge of the distribution and the composition of test statistics.
Methods that try to estimate unmeasured covariates [17, 26, 27] and those that try to recover the empirical null distribution [16] rely on the same principle. They extract information from features that are assumed not to be associated with the outcome of interest. Methods to estimate unknown covariates (e.g., RUV, SVA, and CATE as we used here) either use negative controls or assume the number of associated features to be sparse and, interestingly, they can be unified in a single mathematical framework [17]. Genomic control [9] yields a valid estimate of the inflation factor when calculated from features that are known not to be associated with the phenotype of interest. Similarly, the estimation of the empirical null distribution requires that the vast majority of features follow the null distribution [16]. Our Bayesian method is designed to be flexible in dealing with larger fractions of true associations, which turns out to be crucial in particular for EWAS and TWAS metaanalyses.
Our work extends the work of Devlin and Roeder [9], who originally propose to use genomic control to tackle teststatistic inflation for GWAS, and links their method to the pioneering work of Efron [16] on estimating an empirical null distribution for highdimensional data inference. Hence, although specifically applied to EWAS and TWAS, our statistical method may have implications for any field focusing on statistical inference for highdimensional data, whether it be omics types or imaging data.
Our method of estimating bias and inflation may resolve a common inconsistency in the current analysis of EWAS and TWAS. While it is becoming the norm to report inflation factors calculated using the traditional genomic control approach, inflation is rarely actually dealt with in the analysis, presumably because this is deemed to be too conservative. However, inflation may be substantial, in particular in a metaanalysis, and current practice is bound to introduce false positive findings. We show that estimating the inflation factor using the genomic inflation factor results both in an overestimation of the actual inflation (i.e., it is indeed conservative) and in imprecise estimates contributing to the previously unexplained, high variability across studies. Our method provides a realistic estimate of inflation that does not suffer from a high variability. Moreover, our method is the first to address the previously unrecognized issue of bias in test statistics. In conclusion, our method optimally reduces the number of false positive findings while preserving statistical power and can be seamlessly incorporated into existing workflows for the analysis of EWAS, TWAS, and other omics data.
Methods
Data sets
DNA methylation data and RNAseq data were generated within the Biobankbased Integrative Omics Studies Consortium (http://wiki.bbmri.nl/wiki/BIOS_start). The data comprise four biobanks: Cohort on Diabetes and Atherosclerosis Maastricht (CODAM, n ≈180) [37], LifeLines (LL, n ≈700) [38], the Leiden Longevity Study (LLS, n ≈600) [39], and the Rotterdam Study (RS, n ≈600) [40]. Sample identity of DNA methylation and gene expression data was confirmed using genotype data. Both RNAseq fastq files and DNA methylation idat files are available from the European Genomephenome Archive (EGA) under accession number [EGA:EGAC00001000277] together with phenotypes and measured cell counts used in these analyses. Data were generated by the Human Genotyping facility (HugeF) of ErasmusMC, the Netherlands (www.glimDNA.org).
RNAseq data preprocessing
A detailed description of the RNAseq data processing can be found in Zhernakova et al. [24]. Briefly, total RNA from whole blood, depleted of globin transcripts, was sequenced (2×50bp) using the Illumina HiSeq 2000 platform, and read alignment was performed using STAR (v2.3.0). Subsequently, RNAseq counts were normalized using TMM [41] and transformed to log2 counts per million. Genes that yielded zero counts for all samples across cohorts were removed, which resulted in 45,867 genes (ENSEMBLv73). For all analyses, genes with the lowest overall variance were excluded (5% lowest).
450K DNA methylation data preprocessing
The generation of genomewide DNA methylation data is described by Bonder et al. [25]. Briefly, 500 ng of genomic DNA was bisulfite modified using the EZ DNA Methylation kit (Zymo Research, Irvine, CA, USA) and hybridized on Illumina 450K arrays according to the manufacturer’s protocols. The original idat files were generated by the Illumina iScan BeadChip scanner. Subsequently, sample quality control was performed using MethylAid [42]. Ambiguously mapped probes [43], probes with a high detection P value (> 0.01), probes with a low bead count (< 3 beads), and probes with a low success rate (missing in > 95% of the samples) were set to missing. Samples containing an excess of missing probes (> 5%) were excluded from the analysis. Subsequently, per cohort, imputation [44] was performed to impute the missing values. Functional normalization [45], as implemented in the minfi package [46], was used per cohort. All analyses were performed on M values. Detailed description of the 450K DNA methylation preprocessing steps are available from the gitrepo Leiden450K [47].
White blood cell count prediction
White blood cell counts (WBC), i.e., neutrophils, lymphocytes, monocytes, eosinophils, and basophils, were measured by the standard WBC differential as part of the complete blood count (CBC). A minority of samples were lacking CBC measurements. Since DNA methylation levels are informative of the white blood cell composition [48], we build a linear predictor to infer the white blood cell composition of those samples lacking WBC measurements (Additional file 2). Predicted cell counts were used in the metaanalysis. For the analyses of the cohort subsets, individuals with measured cell counts were selected.
Association analyses
All association analyses were performed using limma’s lmFit function [49]. Since the sample sizes of our data were all above > 100, the empirical Bayes step was skipped. T test statistics were transformed to P values using a standard normal distribution. For the analysis of RNAseq data, we first applied a voomtransformation [50] on the TMMnormalized counts while controlling for known covariates including age, gender, smoking status measured cell counts, and a technical covariate introducing a batch effect (the flowcell identifier of the sequencing machine). For the analysis of DNA methylation data, the functional normalized betavalues [45] were transformed to M values, and again lmFit was used to obtain test statistics for the covariate of interest. Here we included age, gender, smoking status, measured cell counts, and array position as known covariates.
Genomic control and the genomic inflation factor
The genomic inflation factor as originally proposed by Devlin and Roeder [9] is the ratio of the median of a set of trendtest statistics (i.e., obtained by the Armitage’s trend test that follows under the null hypothesis of no association a χ ^{2}distribution with one degree of freedom) divided by the theoretical median, . For example, let w _{1},w _{2},⋯,w _{ p } be a set of p test statistics, following a \({\chi _{1}^{2}}\)distribution with one degree of freedom; the following estimator was proposed to quantify the amount of inflation:
Furthermore, it was proposed to control the inflated test statistics by dividing the test statistics by the estimated amount of inflation; this approach is referred to as genomic control [9].
In EWAS/TWAS test statistics are usually obtained from inference on the coefficients of linear regression models (instead of a trend test), i.e., ttest statistics that can be assumed approximately to follow a standard normal distribution (instead of a χ ^{2}distribution). Therefore, applying genomic control to these test statistics entails dividing by the square root of the genomic inflation factor, \(\sqrt {\lambda _{{\chi ^{2}_{1}}}}\) (instead of \(\lambda _{{\chi ^{2}_{1}}}\)).
Estimation of the unobserved covariates
To investigate whether adding estimated unobserved covariates reduces bias and inflation, we performed EWAS/TWAS with (1) only the covariate of interest, (2) known covariates (e.g., white blood cell counts), and either (3) known covariates plus one, (4) plus two, or (5) plus three principal components estimated from the data, and (6) known covariates with estimated unobserved covariates using CATE [17]. For TWAS, we additionally used RUV [27, 32] and SVA [26]. For EWAS, iSVA [33] and RUVm [34] were used. All algorithms were used with default parameters except for CATE, which was run using calibrate=FALSE.
Simulation studies
The impact of true association on the genomic inflation factor
One hundred sets of 2000 test statistics were generated from a normal mixture distribution with different mixture coefficients (0.8, 0.90, and 0.95). The majority of the null test statistics were drawn from a standard normal, N(0,1), while the alternative test statistics were drawn from a normal distribution, N(μ,1), with μ∼N(0,3). An equal number of positive and negative associations were simulated. For each set of test statistics, inflation factors were calculated to investigate the impact of the number of true associations (Additional file 3). Additional file 3 shows the performance and robustness of BACON in estimating the empirical null distribution when different data generating approaches are used.
Comparing different methods that estimate the empirical null distribution
Efron proposed two methods for estimation of the empirical null distribution from a set of test statistics [16]. In order to compare the performance of those methods with our Bayesian method, sets of test statistics were generated, similar to the approach described above, but under different scenarios: scenario “equal” with equal proportion of positive and negative associations (0.05, 0.05), scenario “skewed” with only positive associations (prop. 0.1), scenario “small” similar to scenario equal with only 0.01 proportion of true associations, and scenario “close” where the distribution for the means had expected value of 1 (instead of 3). For each scenario, 2000 test statistics were generated 100 times. To estimate the empirical null distributions as proposed by Efron, we used the locfdr R package. For both methods, maximum likelihood and moment matching, default parameter settings were used (Additional file 3).
Simulation with unobserved confounding factors
We used the simulation setup of Wang et al. [17] to generate data with confounding factors. Briefly, data Y _{ n×p } for n=100 samples and p=2000 features were generated according to the following model: Y _{ n×p }=X _{ n×1} β ^{T}+Z _{ n×r } γ ^{T}+E, where Z, represents the r=5 unobserved confounding factors model as ZX=X α ^{T}+D, with α representing the strength of confounding. Furthermore, a continuous covariate of interest, X, was sampled from the normal distribution. Effects were introduced by fixing 90% of the βs at zero while the remaining were different from zero. Both E and D represent Gaussian noise. A detailed description of the simulation setup is given by Wang et al. and is available as an R function gen.sim.dat from the package CATE (Additional file 3: section 5).
Simulation with correlated test statistics
Correlated test statistics were generated according to the approach of Efron [51] introducing a blockcorrelation structure among test statistics. The uncorrelated test statistics with effects generated from the normal mixture were added to the test statistics with blockcorrelation structure (Additional file 3: section 3.3 and section 5). The same number of repeated simulations, 100, number of test statistics, 2000, and proportion of null features, 0.9 were used.
The Gibbs sampler
We assume the observed set of test statistics can be modeled by a threecomponent normal mixture:
with 9−1 parameters (the mixture proportions are constrained to sum to one, \(\sum _{j=1}^{3}\epsilon _{j} = 1\)), and ϕ(x;μ _{ j },σ _{ j }) being the density of \(\mathcal {N}(\mu _{j}, {\sigma _{j}^{2}})\). Furthermore, one component represents the empirical null distribution with its estimated mean (i.e., bias) and standard deviation (i.e., inflation). We propose to use a Gibbs sampling algorithm [31, 52, 53] to estimate the parameters of the mixture distribution.
Conjugate prior distributions are used for the means, μ _{ j }, variances, \({\sigma _{j}^{2}}\), and mixture proportions, ε _{ j }. Hence, we assume a normal distribution, , for the means, an inverse gamma distribution, \({\sigma ^{2}_{j}} \sim \mathcal {IG}(\alpha _{j}, \beta _{j})\), for the variances, and a Dirichlet distribution, \((\epsilon _{1}, \epsilon _{2}, \epsilon _{3}) \sim \mathcal {D}(\gamma _{1}, \gamma _{2}, \gamma _{3})\), for the mixture proportions. Well chosen hyperpriors ensure that the occurrence of labeling switching is minimized; i.e., during sampling from the posterior, the null component is switched with one of the alternative components. That is, we take informative hyperpriors for means, the null component, λ _{1}=0, and for the alternative components λ _{2}=−3 and λ _{3}=3 all τ’s are equal to 100. The hyperpriors for the variance parameters are equal for all components α=1.28 and β=0.36 and were taken from Raftery [54]. For the Dirichlet distribution, widely used uniform noninformative prior parameters were chosen: γ _{1}=γ _{2}=γ _{3}=1. Furthermore, datadependent starting values are used to start the algorithm at a good initial point. These are based on the median and median absolute deviation (MAD) of the test statistics. A burnin period of 3000 iterations was used as well as 2000 subsequent samples to estimate the parameters of the mixture distribution using the mean.
Given test statistics x _{ i } (zscores or transformed to zscores) for i=1,⋯,p, prior distributions with hyperparameters, and starting values for the posterior distributions, the Gibbs sampling algorithm is run in the following way:
Iterate for t=1,⋯,5000,

1.
Generate the missing (unobserved) data: \(z_{ij} \sim \mathcal {M}(\tilde {p}_{ij})\) from a multinomial distribution, with parameter p _{ ij }=ε _{ j } ϕ(x _{ i };μ _{ j },σ _{ j }), \(\tilde {p}_{ij}\) represents the normalized proportion \(\left (\sum _{j=1}^{3} = \tilde {p}_{ij} = 1\right)\).

2.
Obtain and

3.
Generate samples from the posteriors according to:
$$ {}\begin{aligned} & \epsilon_{j} \sim \mathcal{D}(\gamma_{j}+n_{j}),\\ &\mu_{j}{\sigma_{j}^{2}} \sim \mathcal{N}\left(\frac{\lambda_{j}\tau_{j} + s_{j}}{n_{j} + \tau_{j}}, \frac{{\sigma_{j}^{2}}+s_{j}}{n_{j} + \tau_{j}}\right),\\ &\sigma^{2}_{j} \sim \Gamma \left(\alpha\,+\,\frac{1}{2}(n_{j}+1), \left(\beta + \frac{1}{2}\tau_{j}(\mu_{j}\lambda_{j})^{2} \,+\, \frac{1}{2}{s_{j}^{2}}\right)^{1}\right). \end{aligned} $$(3)
The latter mimics sampling from an inverse gamma distribution. For clarity, an iteration superscript is omitted. We assume that 3000 iterations (burnin period) are sufficient for the Markov properties to hold and that the samples from the conditional distributions can be assumed to be samples from the joint parameter distribution. We implemented the Gibbs sampling algorithm in C and can either use weighted multinomial sampling method for binned test statistics or a fast sampling method [55] if all individual test statistics are used (userdefined). Optionally, test statistics following a distribution different from the normal distribution can be used by transforming them to zscores. For example, test statistics w _{1},⋯,w _{ p } that follow under the null hypothesis a χ ^{2}distribution with ν degrees of freedom can be transformed to zscores using \(\Phi ^{1}(F_{\chi ^{2}_{\nu }}(w_{i}))\) [16] (Additional file 3).
References
 1
Rakyan VK, Down TA, Balding DJ, Beck S. Epigenomewide association studies for common human diseases. Nat Rev Genet. 2011; 12(8):529–41.
 2
Mill J, Heijmans BT. From promises to practical strategies in epigenetic epidemiology. Nat Rev Genet. 2013; 14(8):585–94.
 3
de Magalhaes JP, Curado J, Church GM. Metaanalysis of agerelated gene expression profiles identifies common signatures of aging. Bioinformatics. 2009; 25(7):875–81.
 4
Peters MJ, et al. The transcriptional landscape of age in human peripheral blood. Nat Commun. 2015; 6:8570.
 5
Hannum G, et al. Genomewide methylation profiles reveal quantitative views of human aging rates. Mol Cell. 2013; 49(2):359–67.
 6
Beineke P, et al.A whole blood gene expressionbased signature for smoking status. BMC Med Genomics. 2012; 5:58. doi:10.1186/17558794558.
 7
Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015; 7:113. doi:10.1186/s1314801501483.
 8
Leek JT, et al.Tackling the widespread and critical impact of batch effects in highthroughput data. Nat Rev Genet. 2010; 11(10):733–9.
 9
Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999; 55(4):997–1004.
 10
Lehne B, et al.A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenomewide association studies. Genome Biol. 2015; 16:37. doi:10.1186/s130590150600x.
 11
Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J. Epigenomewide association studies without the need for celltype composition. Nat Methods. 2014; 11(3):309–11.
 12
Joubert BR, et al.DNA Methylation in Newborns and Maternal Smoking in Pregnancy: Genomewide Consortium Metaanalysis. Am J Hum Genet. 2016; 98(4):680–96.
 13
Yang J, et al.Genomic inflation factors under polygenic inheritance. Eur J Hum Genet. 2011; 19(7):807–12.
 14
Voorman A, Lumley T, McKnight B, Rice K. Behavior of QQplots and genomic control in studies of geneenvironment interaction. PLoS ONE. 2011; 6(5):19416.
 15
Rao P. Some notes on misspecification in multiple regression. Am Stat. 1971; 25(5). doi:10.2307/2686082.
 16
Efron B. Largescale simultaneous hypothesis testing: the choice of a null hypothesis. JASA. 2004; 99(465):96–104.
 17
Wang J, Zhao Q, Hastie T, Owen AB. Confounder adjustment in multiple hypothesis testing. arXiv:1508.04178. 2015.
 18
Devlin B, Roeder K, Wasserman L. Genomic control, a new approach to geneticbased association studies. Theor Popul Biol. 2001; 60(3). doi:10.1006/tpbi.2001.1542.
 19
Devlin B, Bacanu SA, Roeder K. Genomic Control to the extreme. Nat Genet. 2004; 36(11):1129–30.
 20
Verdinelli I, Wasserman L. “Bayesian analysis of outlier problems using the Gibbs sampler”. Stat Comput. 1991; 1. doi:10.1007/BF01889985.
 21
Efron B. Size, power and false discovery rates. Ann Stat. 2007; 34(4). doi:10.1214/009053606000001460.
 22
Schwartzman A. Empirical null and false discovery rate inference for exponential families. Ann Appl Stat. 2008; 2(4). doi:10.1214/08AOAS184.
 23
Schuemie MJ, Ryan PB, DuMouchel W, Suchard MA, Madigan D. Interpreting observational studies: why empirical calibration is needed to correct pvalues. Stat Med. 2014; 33(2):209–18.
 24
Zhernakova DV, et al.Identification of contextdependent expression quantitative trait loci in whole blood. Nat Genet. 2016. doi:10.1038/ng.3737.
 25
Bonder MJ, et al.Disease variants alter transcription factor levels and methylation of their binding sites. Nat Genet. 2016. doi:10.1038/ng.3721.
 26
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007; 3(9):1724–35.
 27
GagnonBartsch JA, Speed TP. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012; 13(3):539–52.
 28
Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, R, Vienna, Austria; 2015. http://www.Rproject.org/.
 29
Huber W, et al. Orchestrating highthroughput genomic analysis with Bioconductor. Nat Methods. 2015; 12(2):115–21.
 30
Kerr KF. Comments on the analysis of unbalanced microarray data. Bioinformatics. 2009; 25(16):2035–41.
 31
Diebolt J, Robert CP. Estimation of finite mixture distributions through Bayesian sampling. JRSS B. 1994; 56(2):363–75.
 32
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNAseq data using factor analysis of control genes or samples. Nat Biotechnol. 2014; 32(9):896–902.
 33
Teschendorff AE, Zhuang J, Widschwendter M. Independent surrogate variable analysis to deconvolve confounding factors in largescale microarray profiling studies. Bioinformatics. 2011; 27(11):1496–505.
 34
Maksimovic J, GagnonBartsch JA, Speed TP, Oshlack A. Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data. Nucleic Acids Res. 2015; 43(16):106.
 35
Thompson JR, Attia J, Minelli C. The metaanalysis of genomewide association studies. Brief Bioinformatics. 2011; 12(3):259–69.
 36
van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in metaanalysis: multivariate approach and metaregression. Stat Med. 2002; 21(4):589–624.
 37
van Greevenbroek MM, et al.The crosssectional association between insulin resistance and circulating complement C3 is partly explained by plasma alanine aminotransferase, independent of central obesity and general inflammation (the CODAM study). Eur J Clin Invest. 2011; 41(4):372–9.
 38
Tigchelaar EF, et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open. 2015; 5(8):006772.
 39
Westendorp RG, et al.Nonagenarian siblings and their offspring display lower risk of mortality and morbidity than sporadic nonagenarians: The Leiden Longevity Study. J Am Geriatr Soc. 2009; 57(9):1634–37.
 40
Hofman A, et al.The Rotterdam Study: 2012 objectives and design update. Eur J Epidemiol. 2011; 26(8):657–86.
 41
Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010; 11(3):25.
 42
van Iterson M, et al.MethylAid: visual and interactive quality control of large Illumina 450k datasets. Bioinformatics. 2014; 30(23):3435–7.
 43
Chen YA, et al.Discovery of crossreactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics. 2013; 8(2):203–9.
 44
Troyanskaya O, et al.Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17(6):520–5.
 45
Fortin JP, et al.Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol. 2014; 15(12):503.
 46
Aryee MJ, et al.Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics. 2014; 30(10):1363–9.
 47
van Iterson M. Quality control, probe/sample filtering and normalization of Infinium HumanMethylation450 BeadChip data: ’The Leiden Approach’. 2016. doi:10.5281/zenodo.158908.
 48
Houseman EA, et al.DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012; 13:86.
 49
Ritchie ME, et al.limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 2015; 43(7):47.
 50
Law CW, Chen Y, Shi W, Smyth GK. voom: Precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014; 15(2):29.
 51
Efron B. Correlation questions In: Cox HMHambly, editor. Largescale inference. New York: Cambridge University Press: 2010. p. 141–62.
 52
Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. 1984; 6(6):721–41.
 53
Casella G, George EI. Explaining the Gibbs sampler. Am Stat. 1992; 46(3). doi:10.2307/2685208.
 54
Raftery AE. Hypothesis testing and model selection In: Gilks W. R, Richardson S, Spiegelhalter DJ, editors. Markov chain Monte Carlo in practice. London: Chapman and Hall: 1996. p. 163–88.
 55
Efraimidisa PS, Spirakisb PG. Weighted random sampling with a reservoir. Inform Process Lett. 2006; 97(6). doi:10.1016/j.ipl.2005.11.003.
 56
Mevik BH, Wehrens R. The pls Package: Principal Component and Partial Least Squares Regression in R. J Stat Softw. 2007; 18(2). doi:10.18637/jss.v018.i02.
Acknowledgements
Samples were contributed by LifeLines (http://lifelines.nl/lifelinesresearch/general), the Leiden Longevity Study (http://www.leidenlangleven.nl), the Netherlands Twin Registry (http://www.tweelingenregister.org), the Rotterdam Study (http://www.erasmusepidemiology.nl/research/ergo.htm), and the CODAM study (http://www.carimmaastricht.nl/). All analyses were carried out on the Dutch national einfrastructure with the support of SURF Cooperative. The BIOS Consortium comprises Bastiaan T. Heimans, Peter A.C. ’t Hoen, Joyce van Meurs, Rick Jansen, Lude Franke, Dorret I. Boomsma, Rene Pool, Jenny van Dongen, Jouke J Hottenga, Marleen M.J. van Greevenbroek, Coen D.A. Stehouwer, Carla J.H. van der Kallen, Casper G. Schalkwijk, Cisca Wijmenga, Sasha Zhernakova, Ettje F. Tigchelaar, P. Eline Slagboom, Marian Beekman, Joris Deelen, Diana van Heernst, Jan H. Veldink, Leonard H. van den Berg, Cornelia M. van Duijn, Bert A. Hofman, Aaron Isaacs, Andre G. Uitterlinden P. Mila Jhamai, Michael Verbiest, H. Eka D. Suchiman, Marijn Verkerk, Ruud van der Breggen, Jeroen van Rooij, Nico Lakenberg, Hailiang Mei, Maarten van Iterson, Michiel van Galen, Jan Bot, Dasha V. Zhernakova, Peter van ’t Hof, Patrick Deelen, Irene Nooren, Matthijs Moed, Martijn Vermaat, Rene Luijk, Marc Jan Bonder, Freerk van Dijk, Wibowo Arindrarto, Szymon M. Kielbasa, Morris A. Swertz, Erik W. van Zwet, and PeterBram ’t Hoen.
The BIOS consortium The mission of the BIOS Consortium is to create a largescale data infrastructure and to bring together BBMRI researchers focusing on integrative omics studies in Dutch Biobanks (www.bbmri.nl/?p=259).
Participants: Management Team: Bastiaan T. Heijmans (chair)^{1}, Peter A.C. ’t Hoen^{2}, Joyce van Meurs^{3}, Rick Jansen^{5}, Lude Franke^{6}.
Cohort collection: Dorret I. Boomsma^{7}, René Pool^{7}, Jenny van Dongen^{7}, Jouke J. Hottenga^{7} (Netherlands Twin Register); Marleen MJ van Greevenbroek^{8}, Coen D.A. Stehouwer^{8}, Carla J.H. van der Kallen^{8}, Casper G. Schalkwijk^{8} (Cohort study on Diabetes and Atherosclerosis Maastricht); Cisca Wijmenga^{6}, Lude Franke^{6}, Sasha Zhernakova^{6}, Ettje F. Tigchelaar^{6} (LifeLines Deep); P. Eline Slagboom^{1}, Marian Beekman^{1}, Joris Deelen^{1}, Diana van Heemst^{9} (Leiden Longevity Study); Jan H. Veldink10, Leonard H. van den Berg10 (Prospective ALS Study Netherlands); Cornelia M. van Duijn^{4}, Bert A. Hofman^{11}, Aaron Isaacs^{4}, André G. Uitterlinden^{3} (Rotterdam Study).
Data Generation: Joyce van Meurs (Chair)^{3}, P. Mila Jhamai^{3}, Michael Verbiest^{3}, H. Eka D. Suchiman^{1}, Marijn Verkerk^{3}, Ruud van der Breggen^{1}, Jeroen van Rooij^{3}, Nico Lakenberg^{1}.
Data management and computational infrastructure: Hailiang Mei (Chair)^{12}, Maarten van Iterson^{1}, Michiel van Galen^{2}, Jan Bot^{13}, Dasha V. Zhernakova^{5}, Rick Jansen^{4}, Peter van ’t Hof^{12}, Patrick Deelen^{5}, Irene Nooren^{13}, Peter A.C. ’t Hoen^{2}, Bastiaan T. Heijmans^{1}, Matthijs Moed^{1}.
Data analysis group: Lude Franke (CoChair)^{6}, Martijn Vermaat^{2}, Dasha V. Zhernakova^{6}, René Luijk^{1}, Marc Jan Bonder^{6}, Maarten van Iterson^{1}, Patrick Deelen^{6}, Freerk van Dijk^{14}, Michiel van Galen^{2}, Wibowo Arindrarto^{12}, Szymon M. Kielbasa^{15}, Morris A. Swertz^{14}, Erik. W van Zwet^{15}, Rick Jansen^{5}, PeterBram ’t Hoen (CoChair)^{2}, Bastiaan T. Heijmans (CoChair)^{1}.
Institutes involved: (1) Molecular Epidemiology Section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
(2) Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
(3) Department of Internal Medicine, ErasmusMC, Rotterdam, The Netherlands
(4) Department of Genetic Epidemiology, ErasmusMC, Rotterdam, The Netherlands
(5) Department of Psychiatry, VU University Medical Center, Neuroscience Campus Amsterdam, Amsterdam, The Netherlands
(6) Department of Genetics, University of Groningen, University Medical Centre Groningen, Groningen, The Netherlands
(7) Department of Biological Psychology, VU University Amsterdam, Neuroscience Campus Amsterdam, Amsterdam, The Netherlands
(8) Department of Internal Medicine and School for Cardiovascular Diseases (CARIM), Maastricht University Medical Center, Maastricht, The Netherlands
(9) Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden, The Netherlands
(10) Department of Neurology, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, The Netherlands
(11) Department of Epidemiology, ErasmusMC, Rotterdam, The Netherlands
(12) Sequence Analysis Support Core, Leiden University Medical Center, Leiden, The Netherlands
(13) SURFsara, Amsterdam, the Netherlands
(14) Genomics Coordination Center, University Medical Center Groningen, University of Groningen, Groningen, the Netherlands
(15) Medical Statistics Section, Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
Funding
This work was done within the framework of the BiobankBased Integrative Omics Studies (BIOS) Consortium funded by BBMRINL, a research infrastructure financed by the Dutch government (NWO 184.021.007).
Availability of data and material
RNAseq fastq files, DNA methylation idat files, phenotypes, and measured cell counts are available from the European Genomephenome Archive (EGA) under accession number [EGA:EGAC00001000277]. Our Bayesian method is available as an R/Bioconductor package BACON: http://bioconductor.org/packages/bacon/under the open source license GPL (≥2). Source code of all analyses and simulations are deposited at https://git.lumc.nl/molepi/biasandinflationwith DOI 10.5281/zenodo.162160.
Authors’ contributions
MvI and BTH designed the study. MvI and EvZ developed the statistical method. MvI performed all statistical analyses and implemented the software. MvI and BTH wrote the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Ethics approval and consent to participate
The study was approved by the institutional review boards of the participating centers (CODAM, Medical Ethical Committee of the Maastricht University; LL, Ethics committee of the University Medical Centre Groningen; LLS, Ethical committee of the Leiden University Medical Center; and RS, Institutional review board (Medical Ethics Committee) of the Erasmus Medical Center). All participants have given written informed consent, and the experimental methods comply with the Helsinki Declaration.
Author information
Affiliations
Consortia
Corresponding author
Additional files
Additional file 1
Additional figures and tables. Additional Figures S1, S2, S3, S4, and S5. Additional Tables S1. and S2. (534 KB PDF)
Additional file 2
Additional text. In the Additional text we prove that genomic control is equivalent to the use of an empirical null distribution. Furthermore, a sketch of a proof is given to show that the omission of a variable introduces bias, and we briefly describe how we impute white blood cell counts. The last section describes all participants and institutes that were involved in the generation of the BIOS data collection. (202 KB PDF)
Additional file 3
Additional simulations. R code and documentation of all simulations described in the manuscript and some additional simulations. (923 KB HTML)
Additional file 4
Additional Excel Table S3a. Additional Excel Table S3a with output of significant results of the EWAS metaanalyses of age: effect sizes, standard error, P values, test statistics of all cohorts, and the metaanalysis results. (1750 KB XLSX)
Additional file 5
Additional Excel Table S3b. Additional Excel Table S3b with output of significant results of the EWAS metaanalyses of smoking: effect sizes, standard error, P values, test statistics of all cohorts, and the metaanalysis results. (133 KB XLSX)
Additional file 6
Additional Excel Table S3c. Additional Excel Table S3c with output of significant results of the TWAS metaanalyses of age: effect sizes, standard error, P values, test statistics of all cohorts, and the metaanalysis results. (121 KB XLSX)
Additional file 7
Additional Excel Table S3d. Additional Excel Table S3d with output of significant results of the TWAS metaanalyses of smoking: effect sizes, standard error, P values, test statistics of all cohorts, and the metaanalysis results. (40 KB PDF)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
van Iterson, M., van Zwet, E.W., the BIOS Consortium. et al. Controlling bias and inflation in epigenome and transcriptomewide association studies using the empirical null distribution. Genome Biol 18, 19 (2017). https://doi.org/10.1186/s1305901611319
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305901611319
Keywords
 Epigenome and transcriptomewide association studies
 Bias
 Inflation
 Empirical null distribution
 Gibbs sampler
 Metaanalysis