Effectively identifying regulatory hotspots while capturing expression heterogeneity in gene expression studies
 Jong Wha J Joo^{1},
 Jae Hoon Sul^{2},
 Buhm Han^{3, 4},
 Chun Ye^{5} and
 Eleazar Eskin^{1, 2, 6}Email author
DOI: 10.1186/gb2014154r61
© Joo et al.; licensee BioMed Central Ltd. 2014
Received: 9 August 2013
Accepted: 7 April 2014
Published: 7 April 2014
Abstract
Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.
Background
Expression quantitative trait loci (eQTL) mapping is an approach linking genetic variation to gene expression to identify genomic loci containing gene expression modulators. An interesting observation in previous eQTL studies is that expression levels of thousands of genes may be affected by genetic variation at a single location called a regulatory hotspot. Although some of the identified regulatory hotspots correspond to genetic variants that truly govern expression of many genes, it has been recently reported that spurious hotspots that do not have genetic effects on genes may appear due to confounding factors such as batch effects. In this paper, we introduce a novel statistical method that effectively eliminates spurious regulatory hotspots while retaining genuine hotspots resulting from true genetic effects.
Understanding the relationship between genetic variation and gene regulation has recently received significant interest. The most common approach for studying this relationship is through eQTL, where both genetic variation and expression levels are collected from a set of individuals and associations between genetic variation and expression are estimated [1–9]. Any identified association, or eQTL, suggests the presence of a region harboring genetic variation that affects expression levels.
In eQTL mapping, two types of eQTLs are analyzed: ciseQTLs that are in close proximity to the gene locus and transeQTLs that occur at greater distances from the gene locus [10]. Previous eQTL studies for multiple organisms [2–4, 6] have shown that many genes are transregulated by a small number of genomic regions, known as ‘regulatory hotspots’. Although several eQTL studies have successfully identified regulatory hotspots [11–13], it has been reported in studies of recombinant inbred mice that regulatory hotspots replicate poorly [14]. Previous studies have discovered that these regulatory hotspots are spurious associations caused by various confounding factors, such as batch effects or other technical artifacts, which induce noise during sample preparation or expression measurements [15–17]. Confounding factors create heterogeneity in expression data and may induce spurious associations between SNPs and gene expressions, leading to the identification of ‘spurious regulatory hotspots’ [18]. In these spurious hotspots, SNPs appear to be associated with gene expression levels, although they do not have genetic effects on the genes.
Several computational methods have been developed to correct for confounding effects using various statistical methods such as singular value decomposition or linear mixed models [18–21]. The main assumption behind most of these methods is that the confounding factors influence the global correlation structure between the gene expression values. Hence, the methods, such as Intersample Correlation Emended (ICE) [18] and SVA [19], attempt to estimate the global correlation structure and use it as a covariate in the association to remove confounding effects from the association statistic. Although these methods effectively remove spurious regulatory hotspots, they may also remove true hotspots caused by genetic factors. This is because the global correlation structure contains genetic effects, so by correcting for the global structure, genetic effects are removed as well. For example, in a wellstudied yeast dataset, several hotspots are known to be true genetic effects since they have been validated by additional data such as protein measurements [22, 23]. Unfortunately, these hotspots are removed in addition to the spurious ones. Other methods [20, 21] also do not explicitly remove true genetic signals and either eliminated true hotspots or failed to remove spurious hotspots in our experiments.
In this paper, we introduce a new method called Nextgeneration Intersample Correlation Emended (NICE) eQTL mapping, which attempts to eliminate spurious regulatory hotspots while retaining hotspots caused by genetic effects by utilizing a novel statistical framework. Our method leverages an insight that confounding factors affect the majority of genes, while genetic effects only affect a subset. This insight allows us to distinguish between confounding and genetic effects. We used a recently developed statistic [24] to differentiate between genes that are affected by both genetic effects and confounding effects versus genes that are affected by only confounding effects. Using genes only affected by confounding, we are able to correct for the confounding effects but preserve the genetic effects. We first show by simulations that NICE successfully eliminates spurious regulatory hotspots while preserving regulatory hotspots corresponding to real genetic effects. On the other hand, previous methods either fail to eliminate confounding effects or fail to retain the genetic effects.
We demonstrated the utility of NICE with a yeast dataset. Versions of a yeast dataset were generated in 2005 [2] and 2008 [25]. Since they were generated 3 years apart in different locations, the hotspots that are shared between the datasets are likely to be the real genetic effects, while hotspots that are different between the datasets are likely to be spurious hotspots. We used our method on only the first dataset to see if we could discriminate between which hotspots are real and spurious as determined by the second dataset. Applied to the yeast dataset, NICE identified 83% (sensitivity) of the putative regulatory hotspots that are consistent between the two versions of the yeast dataset. Previous methods applied to this dataset either eliminated many of the putative hotspots or predicted many spurious hotspots. In addition, NICE eQTL mapping identified either more or a comparable number of cis associations relative to previous methods. Furthermore, applied to a yeast dataset grown in different conditions, NICE identified genes that are related to gene–environment interactions and discovered novel yeast regulatory hotspots that are likely to have a true biological mechanism.
Results
NICE eQTL mapping
To eliminate spurious associations, ICE [18] models the confounding effects by estimating the global correlation structure of the expression levels of all genes and uses this structure as a covariate in the association statistic. This has the same effect as if the confounding factor itself is included as a covariate in the association statistic, removing its effect. Unfortunately, any regulatory hotspots, as in the case of the first three genes of Figure 1(b), will also be captured in the global correlation structure and be eliminated. For this reason, ICE [18] tends to eliminate true regulatory hotspots in addition to confounding effects.
In contract to ICE [18], NICE uses only a subset of genes to model the global correlation structure between expression levels used to correct for confounding factors. Since most genes are affected by confounding effects, any subset of genes will likely capture the confounding effects and utilizing only those genes to estimate the global correlation structure is enough to correct for the confounding factor. Theoretically, if we can correct for confounding effects using the genes that are not involved in true regulatory hotspots, we would eliminate spurious associations while preserving true genetic associations. Unfortunately, we do not know in advance which genes are involved in the regulatory hotspots, which complicates the choice of which genes to use to correct for the confounding. Practically, almost all genes are affected by confounding effects while only some genes have both confounding and genetic effects. We expect this second group of genes to show stronger associations than the others. Thus we use the weakly associated genes to model the confounding factors. For example, in Figure 1(c), by using the four most weakly associated genes, we can correct for the confounding effects but preserve the genetic effects.
NICE eliminates spurious regulatory hotspots while preserving true genetic effects
To validate that our method eliminates spurious regulatory hotspots while preserving regulatory hotspots corresponding to real genetic effects, we generated a simulated dataset with both true regulatory hotspots and a batch effect that creates spurious hotspots. We created a dataset that has 100 samples with 1,000 SNPs and 1,000 gene expression levels. We added five transregulatory hotspots and cis effects. For each of the transregulatory hotspots, 20% of the genes have trans effects. SNPs were randomly generated with minor allele frequencies of 30%. A batch effect was simulated where expression levels in the first half of the samples were correlated with each other, but not correlated with the second half of the samples, and vice versa.
Number of real, missing and spurious hotspots identified by different methods applied to simulated data
Method  Real hotspots  Missing hotspots  Spurious hotspots  Sensitivity  False discovery rate 

ttest  5  0  8  1.0  0.62 
SVA  5  0  0  1.0  0 
ICE  0  5  0  0  N/A 
LMMEH  0  5  0  0  N/A 
PANAMA  5  0  8  1.0  0.62 
NICE  5  0  0  1.0  0 
NICE eliminates spurious hotspots while preserving genetic hotspots for a yeast dataset
We took advantage of a unique dataset consisting of two versions of a yeast dataset generated in 2005 [2] and 2008 [25] to validate our method. The two datasets contain similar strains, but were generated 3 years apart in different locations. For this reason, the hotspots that are shared between the datasets are likely real genetic effects, while hotspots that are different between the datasets may be spurious hotspots caused by technical confounding factors present at the time of generation of the datasets. In addition, some of these hotspots were further validated by other experimental data such as protein levels [22, 23].
The number of putative, missing and spurious hotspots for the 2005 yeast data [2]
Method  Putative hotspots  Missing hotspots  Spurious hotspots  Sensitivity  False discovery rate 

ttest  9  3  0  0.75  0 
SVA  5  7  0  0.42  0 
ICE  8  4  3  0.67  0.27 
LMMEH  2  10  5  0.17  0.71 
PANAMA  5  7  0  0.42  0 
NICE  10  2  1  0.83  0.09 
NICE discovered novel yeast regulatory hotspots
We reanalyzed the 2008 yeast dataset described above using NICE to demonstrate the utility of our approach. The dataset contains expression levels for yeast strains grown in both glucose and ethanol media. In our experiments above, we compared the consistency between the 2005 data and the 2008 data both for yeast grown in glucose. Here we analyzed both conditions in the 2008 data to identify both hotspots in each condition as well as hotspots involved in gene–environment interactions consistent with the previous analyses of this data [25]. To be consistent with the previous analyses, we utilized the method for determining the presence of a hotspot defined in Smith and Kruglyak instead of the metric we used above [25]. We began by dividing the yeast genome into 611 20kb bins. For each bin, we counted the number of significant trans linkages in the bin. Assuming a Poisson process, the number of expected linkages in each bin is the ratio between the number of trans linkages and the number of bins. For simplicity, we used the top 3,000 trans associations identified by each method yielding λ=4.9. After adjusting for the number of bins using a Bonferroni correction, a bin was considered to have statistically significance (P<0.05) if it has >13 linkages. When we identified significant linkages using a P value cutoff (P<5×10^{−5}), we achieved almost the same result.
To compare the regulatory hotspots found by various methods, we first defined 11 putative regulatory hotspots from a collection of independent experiments using the same parental strains grown in glucose [1, 27] (Additional file 3: Table S2). Some of these hotspots were expected because of deletions in one of the strains including a hotspot at Chr3:90000 for LEU2 and a hotspot at chr5:110000 for URA3.
The number of putative, missing and additional hotspots for the 2008 yeast dataset grown in glucose media [25]
Method  Putative  Missing  Additional  Glucose shared  Ethanol shared 

ttest  9  2  5  2  3 
ICE  8  3  14  5  13 
NICE  9  2  17  8  15 
One of the additional hotspots NICE found to be shared between ethanol and glucose is at Chr7:380000. However, from the ttest results, this hotspot appears to be ethanol specific. We further confirmed that this hotspot was also found by NICE for the 2005 data, suggesting that it is likely to be a real hotspot that is not condition specific. The two possible candidate genes are RPB9 and MNP1 since the regulatory hotspot is linked in cis to the expression of both of these genes at P values of 6.8×10^{−6} and 2.2×10^{−4}, respectively. RPB9 is a RNA polymerase II subunit that is crucial for transcription fidelity while MNP1 is a putative mitochondrial ribosomal protein that is required for respiratory growth. NICE also found an additional hotspot at Chr14:1360000 for the glucose data from both 2005 and 2008, but which is absent from the ethanol data suggesting that it is a glucosespecific hotspot. The ttest did not find this hotspot for either glucose dataset. The closest gene is APT2, which is an apparent pseudogene that is not expressed in normal conditions. Interestingly, with our data, we found a strong association between Chr14:1360000 and APT2 in cis at P=2.3×10^{−17}, suggesting that this gene might be functional in a glucosedependent way.
Discussion
In this paper, we present a novel approach, NICE, for identifying true genetic regulatory hotspots while eliminating spurious hotspots caused by confounding factors. We leveraged the insight that confounding factors are likely to affect the majority of genes, while genetic effects are likely to affect only a smaller subset. This insight allowed our approach to distinguish between true and spurious regulatory hotspots. Our approach is related to previous methods that correct for confounding factors, such as ICE or SVA, which model the global correlation structure and use this structure to correct the association statistics to eliminate the effect of any confounding factors affecting the association statistic. NICE uses only a subset of genes predicted not to be part of the true genetic hotspot to model the global correlation structure between expression levels to correct for confounding factors, which eliminates the confounding factors but preserves true hotspots. We compared several previous approaches [18–21] with both simulated and real datasets, and demonstrated that our method achieved higher or comparable statistical power when identifying associations while correcting for confounding factors.
While our approach, NICE, extends the mixed model approach that ICE [18] used, in principle, the basic idea behind our approach can be applied to other approaches for correcting for expression heterogeneity such as the SVA approach based on singular value decomposition [19]. In this method, for each SNP, a separate singular value decomposition would be computed only taking into account genes that are predicted not to be part of a genetic hotspot. Similarly, the techniques used in LMMEH [20] can also be adapted in this framework to incorporate multiple variance components to correct for population structure and also correct for bias in estimating the global correlation structure in the presence of population structure.
Our method is based on the assumption that confounding factors are likely to affect the majority of genes, while genetic effects are likely to affect only a subset of the genes. While our approach is an improvement over current methods, in some cases this assumption may be violated, for example, slightly different growth temperatures between batches may result in a specific subset of genes being differentially expressed (e.g. heat shock, cell cycle regulators, etc.). In these cases, our approach would be unable to distinguish those confounding effects from real genetic effects. An additional challenge in eQTL studies is correcting for multiple testing. Possible approaches for multiple testing correction are applying permutation tests or FDRs. Unfortunately, when confounding is present, the confounding causes a violation of the basic assumption necessary for these approaches, which is that the individuals in the sample are independent and identically distributed. Shared confounding factors induces complex dependencies among the gene expression patterns of individuals and complicates multiple testing. How to correct for multiple testing in the presence of confounding is a fundamental problem and a promising avenue of future work, which is beyond the scope of this paper.
Conclusions
In this paper, we introduce a novel statistical method that effectively eliminates spurious regulatory hotspots resulting from various confounding factors while retaining genuine hotspots resulting from true genetic effects. In simulations, our method perfectly segregates genuine and spurious hotspots. We validate our method on yeast data where the locations of true genetic hotspots are known through concordance between replicated datasets. Our method achieves greater sensitivity (83%) in detecting concordant hotspots compared to previous methods while retaining a low false discovery rate. In addition to detecting regulatory hotspots, our method identifies more or a comparable number of ciseQTLs than other methods. We further apply our method to yeast data grown in different conditions to identify genebyenvironment interactions and show that our method discovers novel yeast regulatory hotspots that are likely to have a true biological mechanism.
Materials and methods
Generative model
Let n be the number of individuals, m the number of genes and l the number of SNPs. Y is an n×m matrix of the gene expression values, μ is an n×l matrix of the means of expression levels of individuals, X is an n×l matrix with SNPs encoded by 0 and 1 for haploid and 0, 1 and 2 for diploid, β is an l×m matrix for their coefficients, and u and e are n×m matrices with multivariate normal random variables sampled from $N(0,{\sigma}_{g}^{2}H)$ and $N(0,{\sigma}_{e}^{2}I)$ accounting for the confounding effects and random errors. Here, H is an n×n covariance matrix that explains the intersample correlation structure induced by confounders and I is an n×n identity matrix. ${\sigma}_{g}^{2}$ and ${\sigma}_{e}^{2}$ are coefficients of the two variance components.
eQTL mapping
where y_{ i } is a size n vector denoting gene expression levels of individuals, μ_{ i } is a size n vector denoting the mean of expression levels of individuals, x_{ j } is a size n binary vector denoting SNPs of individuals, ${u}_{i}\sim N(0,{\sigma}_{g}^{2}H)$ are confounding effects, and ${\epsilon}_{\mathit{\text{ij}}}\sim N(0,{\sigma}_{e}^{2}I)$ are residual errors. The null hypothesis that we want to test is β_{ i j }=0. Typically, H is defined or estimated before the eQTL mapping. Given an estimated $\hat{H}$, we use the efficient mixedmodel association (EMMA) C package [28] to estimate efficiently the variance components (${\sigma}_{g}^{2}$ and ${\sigma}_{e}^{2}$). We use the F test as previously suggested on the basis of REML (Restricted Maximum Likelihood) estimates of variance components [28–30]. The challenge in this model is how to estimate $\hat{H}$ that is close to the true covariance structure of confounding, H.
ICE eQTL mapping
The ICE eQTL mapping approach [18] utilizes a global intersample correlation generated from all genes to estimate H. The global intersample correlation matrix for an expression dataset is generated as follows. Let Y be an m×n expression matrix with n individuals for m genes. Let μ_{ i },σ_{ i } be the mean and standard deviation of expression values of the ith genes (Y_{i1},Y_{i2},...,Y_{ i n }). Let Z be an m×n matrix with each element Z_{ i j }=(Y_{ i j }−μ_{ i })/σ_{ i }. The intersample correlation matrix is defined as the covariance matrix of Z, $\hat{H}=Cov\left(Z\right)$. The estimated intersample correlation matrix $\hat{H}$ is then used in the linear mixed model in equation (2) to correct for the confounding effects.
NICE eQTL mapping
We propose a new eQTL mapping approach called NICE eQTL mapping. NICE builds upon the framework of ICE eQTL mapping but uses a more refined strategy to estimate H, the covariance matrix of confounding effects. The primary limitation of ICE is that it uses the global intersample correlation generated from all genes. If there exists a regulatory hotspot that affects many genes, ICE will overcorrect for the confounding and remove the associations to the regulatory hotspot. To overcome this challenge, we must use the genes that are only affected by the confounding but not by the regulatory hotspots to estimate H. It turns out that segregating these two groups of genes is a highly challenging computational problem.
Assumptions
We assume that confounding affects the global correlation structure of the gene expressions thereby affecting most of the genes. This is a standard assumption consistent with previous approaches. We then assume that true regulatory hotspots affect only a subset of the genes. This assumption will be invalid only if a hotspot affects most of the genes, which will be unlikely in practice. Our goal is to separate the genes affected by the true genetic effects from the genes affected only by the confounding. If we can successfully separate them, we will be able to estimate H more accurately using the genes affected only by the confounding.
To this end, we make an assumption that the effect size of genetic effects is greater than the magnitude by which the confounding affects the expression levels. That is, we assume that the genes with true genetic effects tend to have more significant results than the other genes affected only by the confounding. This assumption may not be true if the genetic effects are small and the confounding is severe, but in such cases, the noisy data will be highly challenging and in this paper we will ignore such cases. Additionally, we assume that the true genetic effects of regulatory hotspot may have a structure. For example, the hotspot may be related to an enhancer element upregulating many genes, in which case the mean of the effect will be nonzero. Ideally, we would want to use such structures to discriminate the true genetic effects from confounding.
Bayesian framework
Note that a few simplifications are employed in this model. First, based on our assumption that the confounding effects are sufficiently smaller than the genetic effects, we approximated the confounding effects as zero. Second, to capture the possible structure within genetic effects, we employed the mean term μ. Although this is a simplified model, we found that this approach can capture the majority of the genes affected by the genetic effects, which turns out to be sufficient for our purpose of finding an accurate H.
consisting of the probability of $\overrightarrow{\beta}$ given T and the prior probability of T.
Connection to metaanalytic approach
It turns out that our Bayesian model for eQTL mapping is equivalent to a metaanalysis model although their contexts are different. In a metaanalysis of genetic association studies that combines multiple independent studies, if there exists heterogeneity, which refers to the differences in effect sizes of studies [31], it is challenging to predict which study has an effect and which study does not. Thus, the problem of finding studies with an effect is essentially equivalent to the problem of finding genes having genetic effects in our context. Recently, we have developed an efficient method to solve this problem in the context of metaanalysis [24]. Here we adapt this approach to calculate the posterior probability of T. We briefly describe below how we can calculate g(t).
where t is the number of 1’s in t and B is the beta function.
See Han and Eskin [24] for the details of the derivation. As a result, we can calculate g(t) for every t.
Markov chain Monte Carlo
 1.
Start from a random t.
 2.
Choose a next t, t ^{′}, based on the moves defined below.
 3.
If g(t)<g(t ^{′}), move to t ^{′}. Otherwise, move to t ^{′} with probability g(t ^{′})/g(t).
 4.
Repeat from step 2.
The set of moves we use for choosing t^{′} is {M_{1},M_{2},…,M_{ m }}∪{M_{shuffle}}. M_{ i } is a simple flipping move of T_{ i } between 0 and 1. M_{shuffle} is a move that shuffles the values of T. At each step, we randomly choose a move from this set assuming a uniform distribution. Other moves can also be used, such as moves based on Bayes’ factors. We allow n_{ B } burnin and sample n_{ S } times. After sampling, n_{ S } samples gives us an approximation of the distribution over g(t), which subsequently gives the approximations of equation (3). Calculating the posterior probability is the most computationally intensive part of NICE relative to ICE [18] with respect to the running time. By using MCMC, we make dramatic reductions in computational cost, which allows NICE to scale to large datasets.
NICE intersample correlation matrix
After we calculate the posterior probability that a gene is affected by the true genetic effect of the SNP, we select genes with a probability less than a threshold η=0.5. This set of genes represents the genes that are putatively affected only by the confounding. Thus, we use this set of genes to build the intersample correlation matrix ${\hat{H}}_{\text{NICE}}$. Then we apply ${\hat{H}}_{\text{NICE}}$ to the linear mixed model (2) to correct for the confounding in our eQTL mapping. The reason why we choose η=0.5 as a threshold is because we want to find a subset of genes approximately without genetic effects. Although it is ideal to select all the genes without genetic effects, any subset of those genes is likely to capture the global correlation structure as shown in Figure 1(c), and is enough to correct for confounding effects. We choose genes that have an effect with less than 50% chance to select genes that are putatively affected only by the confounding effect. However, we find that unless the threshold is extreme (e.g. η≤0.1 or η≥0.9), all thresholds yield similar results and the result is robust to the parameter (Additional file 4: Figure S2). We count the number of genes selected by NICE using the posterior probability with threshold η=0.5 applied to the yeast data generated in 2005 [2] (blue dots in Additional file 5: Figure S3). Except for the putative hotspots, mostly, NICE uses the majority of the genes to build the intersample correlation matrix ${\hat{H}}_{\text{NICE}}$ similar to ICE [18].
Implementation
To calculate the posterior probability in equation (3), we used METASOFT [31] with prior parameters, σ=0.05, α_{1}=1 and α_{2}=5. We used σ=0.05 by assuming a small effect size. However, an effect size of up to 0.4, which is a possible choice as a large effect size in a genomewide association study [32, 33], did not affect the results significantly (Additional file 6: Figure S4). We assume that confounding affects most of the genes while true regulatory hotspots affect only a subset of the genes. Based on the assumption, we assume that 20% of the genes have trans effects for our prior (α_{1}/α_{2}=0.2). We used α_{1}=1 and α_{2}=5 to give a diffuse distribution. In practice, changing the α_{1} and α_{2} priors gives similar results as changing the threshold η (data not shown). As shown in Additional file 4: Figure S2, the results are robust unless the threshold/priors are too extreme. If one does not have prior information about a dataset, we suggest using these default priors as they are based on our model assumption. We used n_{ B }=1,000 burnin and n_{ S }=1,000,000 sampling in MCMC. We selected genes with posterior probability less than η=0.5. If less than 1% of the genes were selected to calculate the covariance matrix, we used the standard ttest instead of NICE.
Pvalue based approach
Instead of using the posterior probability described in the previous sections, here we show whether a more standard test statistic, such as the P value from the standard ttest, could be used for selecting genes without genetic effects to estimate the intersample correlation matrix H. We use the P value for selecting genes without genetic effects in the following approach. For each SNP, we first order the genes based on the P value obtained using the standard ttest, { g_{1},g_{2},…,g_{ m }}, where g_{1} is the gene with the largest P value and g_{ m } is the gene with the smallest P value when there are m genes. Then we select the first x% of the ordered genes { g_{1},g_{2},…,g_{x m/100}} as the genes without genetic effects and use the expression levels of those genes to estimate H. The following processes are the same as those used for NICE. Let us say α is the percentage of genes that have trans effects on a transregulatory hotspot. When we apply this approach for various simulated datasets with different α, x=100−α gives the best estimation of H for correcting for the confounding effects but retaining the true genetic effects (here we show only the case when α=20). However, when we use fewer (x<100−α) or more (x>100−α) genes, we fail to remove confounding effects or fail to retain the true genetic effects, respectively. Additional file 7: Figure S5 shows eQTL maps for when this approach was applied to our simulated data. Our simulated data has trans effects on 20% of the genes (α=20), in other words, 80% of genes do not have trans effects. Therefore, when we use 80% (x=80) of the genes to build the intersample correlation matrix H, we are able to correct for the confounding effects but retain the genetic effects. However, when we use fewer of the genes, e.g. 60% (x=60), or more of them, e.g. 99% (x=99), we either fail to remove confounding effects or fail to retain the true genetic effects, respectively. Unfortunately, we do not know how many genes have trans effects for each marker in advance. Moreover, we note that this approach creates many spurious associations other than the ones induced by confounding effects. For example, many horizontal lines appear in the eQTL map (Additional file 7: Figure S5(a)). This is because when we select x% of genes with the largest P values, some of the selected genes are shared between many SNPs and this creates spurious associations between the shared genes and the SNPs.
We also applied this approach to the yeast dataset generated in 2005 [2] using 10%, 30%, 50%, 70% and 90% of the genes. As a result, it missed many putative hotspots and made many false positive predictions (Additional file 8: Figure S6). Thus, we conclude that the P value is ineffective for selecting genes without genetic effects. On the other hand, the posterior probability that we use for NICE is robust as the value of η neither has a significant influence on the results nor is specific to the datasets.
Simulated dataset
We generated a simulated dataset for 1,000 genes, 1,000 SNPs, and 100 samples based on our generative model, equation (1), with σ_{ g }=0.9 and σ_{ e }=0.1. Assuming haploidy, the SNPs were encoded by 0 and 1 and randomly generated with a minor allele frequency of 30%. A batch effect was simulated as a confounding effect where expression levels in the first half of the samples were correlated with each other, but not correlated with the second half of the samples, and vice versa. Five randomly selected transregulatory hotspots were simulated and for each of them, 20% of the genes had trans effects of size 0.4 where half had positive effects and the other half had negative effects. The cis effect was simulated with a size of 0.5.
Yeast datasets
We evaluated our method using replicate gene expression datasets. We used two versions of a yeast dataset produced 3 years apart at different locations using different microarray platforms. The first dataset [2] was generated in 2005, of which 6,138 probes and 2,956 genotyped loci in 112 segregants were used. The second dataset [25] was generated in 2008, of which 6,138 probes and 2,956 genotyped loci in 109 segregants were used. We classified the eQTL as cisacting when the location of an SNP and the location of a probe were within 50 kb. We calculated the number of ciseQTLs for different FDRs where the FDRs were calculated using the q value function of R.
Running previous methods
For running previous methods, SVA [19], ICE [18], LMMEH [20] and PANAMA [21], we downloaded the program available from the authors and ran it using default options. For running SVA, the ‘twostep’ method was used. For running LMMEH, eLMM v1.2 was used for generating the covariance matrix K_{ E H } and FaSTLMM v2.05 [34] was used for calculating the associations. For running eLMM, the ICE covariance matrix was used for the initial K_{ E H }. The EM (Expectation Maximization) steps for each full iteration was set to 3 and 10 ∼ 20 number of total iterations was applied. eLMM includes LMMEHPS, which corrects for confounding factors as well as population structure. We used LMMEH instead of LMMEHPS because neither our simulated nor yeast dataset contained a population structure. In addition, LMMEHPS failed to run on our Windows machine with 1.73 GHz Intel Core i7 CPU and 4 GB RAM. Nicolo Fusi, the author of PANAMA, helped with running this program.
Abbreviations
 eQTL:

expression quantitative trait loci
 FDR:

false discovery rate
 ICE:

Intersample Correlation Emended
 kb:

kilobase
 MCMC:

Markov chain Monte Carlo
 NICE:

Nextgeneration Intersample Correlation Emended
 SNP:

single nucleotide polymorphism.
Declarations
Acknowledgements
JWJJ, JHS, BH and EE are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676, 1065276, 1302448 and 1320589, and National Institutes of Health grants K25HL080079, U01DA024417, P01HL30568, P01HL28481, R01GM083198, R01MH101782 and R01ES022282. We acknowledge the support of the National Institute of Neurological Disorders and Stroke’s Informatics Center for Neurogenetics and Neurogenomics (P30 NS062691).
Authors’ Affiliations
References
 Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science. 2002, 296: 752755. 10.1126/science.1069516. doi: 10.1126/science.1069516View ArticlePubMedGoogle Scholar
 Brem RB, Kruglyak L: The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci USA. 2005, 102: 15721577. 10.1073/pnas.0408709102. doi: 10.1073/pnas.0408709102View ArticlePubMedPubMed CentralGoogle Scholar
 Keurentjes JJB, Fu J, Terpstra IR, Garcia JM, van den Ackerveken G, Snoek LB, Peeters AJM, Vreugdenhil D, Koornneef M, Jansen RC: Regulatory network construction in Arabidopsis by using genomewide gene expression quantitative trait loci. Proc Natl Acad Sci USA. 2007, 104: 17081713. 10.1073/pnas.0610429104. [doi: 10.1073/pnas.0610429104]View ArticlePubMedPubMed CentralGoogle Scholar
 Chesler EJ, Lu L, Shou S, Qu Y, Gu J, Wang J, Hsu HC, Mountz JD, Baldwin NE, Langston MA, Threadgill DW, Manly KF, Williams RW: Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37: 233242. 10.1038/ng1518. doi: 10.1038/ng1518View ArticlePubMedGoogle Scholar
 Bystrykh L, Weersing E, Dontje B, Sutton S, Pletcher MT, Wiltshire T, Su AI, Vellenga E, Wang J, Manly KF, Lu L, Chesler EJ, Alberts R, Jansen RC, Williams RW, Cooke MP, de Haan G: Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Nat Genet. 2005, 37: 225232. 10.1038/ng1497. doi: 10.1038/ng1497View ArticlePubMedGoogle Scholar
 Cheung VG, Spielman RS, Ewens KG, Weber TM, Morley M, Burdick JT: Mapping determinants of human gene expression by regional and genomewide association. Nature. 2005, 437: 13651369. 10.1038/nature04244. doi: 10.1038/nature04244View ArticlePubMedPubMed CentralGoogle Scholar
 Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, Montgomery S, Deloukas P, Tavaré S: Population genomics of human gene expression. Nat Genet. 2007, 39: 12171224. 10.1038/ng2142. doi: 10.1038/ng2142View ArticlePubMedPubMed CentralGoogle Scholar
 Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, Mouy M, Steinthorsdottir V, Eiriksdottir GH, Bjornsdottir G, Reynisdottir I, Gudbjartsson D, Helgadottir A, Jonasdottir A, Jonasdottir A, Styrkarsdottir U, Gretarsdottir S, Magnusson KP, Stefansson H, Fossdal R, Kristjansson K, Gislason HG, Stefansson T, Leifsson BG, Thorsteinsdottir U, Lamb JR, et al: Genetics of gene expression and its effect on disease. Nature. 2008, 452: 423428. 10.1038/nature06758. doi: 10.1038/nature06758View ArticlePubMedGoogle Scholar
 Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG: Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet. 2007, 39: 226231. 10.1038/ng1955. doi: 10.1038/ng1955View ArticlePubMedPubMed CentralGoogle Scholar
 Michaelson JJ, Loguercio S, Beyer A: Detection and interpretation of expression quantitative trait loci (eQTL). Methods. 2009, 48: 265276. 10.1016/j.ymeth.2009.03.004. doi: 10.1016/j.ymeth.2009.03.004View ArticlePubMedGoogle Scholar
 Cervino AC, Li G, Edwards S, Zhu J, Laurie C, Tokiwa G, Lum PY, Wang S, Castellani LW, Castellini LW, Lusis AJ, Carlson S, Sachs AB, Schadt EE: Integrating QTL and highdensity SNP analyses in mice to identify Insig2 as a susceptibility gene for plasma cholesterol levels. Genomics. 2005, 86: 505517. 10.1016/j.ygeno.2005.07.010. doi: 10.1016/j.ygeno.2005.07.010View ArticlePubMedGoogle Scholar
 Hillebrandt S, Wasmuth HE, Weiskirchen R, Hellerbrand C, Keppeler H, Werth A, SchirinSokhan R, Wilkens G, Geier A, Lorenzen J, Köhl J, Gressner AM, Matern S, Lammert F: Complement factor 5 is a quantitative trait gene that modifies liver fibrogenesis in mice and humans. Nat Genet. 2005, 37: 835843. 10.1038/ng1599. doi: 10.1038/ng1599View ArticlePubMedGoogle Scholar
 Wang X, Korstanje R, Higgins D, Paigen B: Haplotype analysis in multiple crosses to identify a QTL gene. Genome Res. 2004, 14: 17671772. 10.1101/gr.2668204. doi: 10.1101/gr.2668204View ArticlePubMedPubMed CentralGoogle Scholar
 Peirce JL, Li H, Wang J, Manly KF, Hitzemann RJ, Belknap JK, Rosen GD, Goodwin S, Sutter TR, Williams RW, Lu L: How replicable are mRNA expression QTL?. Mamm Genome. 2006, 17: 643656. 10.1007/s0033500501878. doi: 10.1007/s0033500501878View ArticlePubMedGoogle Scholar
 Churchill GA: Fundamentals of experimental design for cDNA microarrays. Nat Genet. 2002, 32 Suppl: 490495. doi: 10.1038/ng1031View ArticlePubMedGoogle Scholar
 Fare TL, Coffey EM, Dai H, He YD, Kessler DA, Kilian KA, Koch JE, LeProust E, Marton MJ, Meyer MR, Stoughton RB, Tokiwa GY, Wang Y: Effects of atmospheric ozone on microarray data quality. Anal Chem. 2003, 75: 46724675. 10.1021/ac034241b.View ArticlePubMedGoogle Scholar
 Branham WS, Melvin CD, Han T, Desai VG, Moland CL, Scully AT, Fuscoe JC: Elimination of laboratory ozone leads to a dramatic improvement in the reproducibility of microarray gene expression measurements. BMC Biotechnol. 2007, 7: 810.1186/1472675078. doi: 10.1186/1472675078View ArticlePubMedPubMed CentralGoogle Scholar
 Kang HM, Ye C, Eskin E: Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 2008, 180: 19091925. 10.1534/genetics.108.094201. doi: 10.1534/genetics.108.094201View ArticlePubMedPubMed CentralGoogle Scholar
 Leek JT, Storey JD: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3: 17241735. doi: 10.1371/journal.pgen.0030161View ArticlePubMedGoogle Scholar
 Listgarten J, Kadie C, Schadt EE, Heckerman D: Correction for hidden confounders in the genetic analysis of gene expression. Proc Natl Acad Sci USA. 2010, 107: 1646516470. 10.1073/pnas.1002425107. doi: 10.1073/pnas.1002425107View ArticlePubMedPubMed CentralGoogle Scholar
 Fusi N, Stegle O, Lawrence ND: Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput Biol. 2012, 8: 100233010.1371/journal.pcbi.1002330. doi: 10.1371/journal.pcbi.1002330View ArticleGoogle Scholar
 Foss EJ, Radulovic D, Shaffer SA, Ruderfer DM, Bedalov A, Goodlett DR, Kruglyak L: Genetic basis of proteome variation in yeast. Nat Genet. 2007, 39: 13691375. 10.1038/ng.2007.22. doi: 10.1038/ng.2007.22View ArticlePubMedGoogle Scholar
 Perlstein EO, Ruderfer DM, Roberts DC, Schreiber SL, Kruglyak L: Genetic basis of individual differences in the response to smallmolecule drugs in yeast. Nat Genet. 2007, 39: 496502. 10.1038/ng1991. doi: 10.1038/ng1991View ArticlePubMedGoogle Scholar
 Han B, Eskin E: Interpreting metaanalyses of genomewide association studies. PLoS Genet. 2012, 8: 100255510.1371/journal.pgen.1002555. doi: 10.1371/journal.pgen.1002555View ArticleGoogle Scholar
 Smith EN, Kruglyak L: Gene–environment interaction in yeast gene expression. PLoS Biol. 2008, 6: 8310.1371/journal.pbio.0060083. doi: 10.1371/journal.pbio.0060083View ArticleGoogle Scholar
 Devlin B, Roeder K: Genomic control for association studies. Biometrics. 1999, 55: 9971004. 10.1111/j.0006341X.1999.00997.x.View ArticlePubMedGoogle Scholar
 Yvert G, Brem RB, Whittle J, Akey JM, Foss E, Smith EN, Mackelprang R, Kruglyak L: Transacting regulatory variation in Saccharomyces cerevisiae, and the role of transcription factors. Nat Genet. 2003, 35: 5764.View ArticlePubMedGoogle Scholar
 Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E: Efficient control of population structure in model organism association mapping. Genetics. 2008, 178: 17091723. 10.1534/genetics.107.080101. doi: 10.1534/genetics.107.080101View ArticlePubMedPubMed CentralGoogle Scholar
 Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES: A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006, 38: 203208. 10.1038/ng1702. doi: 10.1038/ng1702View ArticlePubMedGoogle Scholar
 Zhao K, Aranzana MJ, Kim S, Lister C, Shindo C, Tang C, Toomajian C, Zheng H, Dean C, Marjoram P, Nordborg M: An Arabidopsis example of association mapping in structured samples. PLoS Genet. 2007, 3: 410.1371/journal.pgen.0030004. doi: 10.1371/journal.pgen.0030004View ArticleGoogle Scholar
 Han B, Eskin E: Randomeffects model aimed at discovering associations in metaanalysis of genomewide association studies. Am J Hum Genet. 2011, 88: 586598. 10.1016/j.ajhg.2011.04.014. doi: 10.1016/j.ajhg.2011.04.014View ArticlePubMedPubMed CentralGoogle Scholar
 Stephens M, Balding DJ: Bayesian statistical methods for genetic association studies. Nat Rev Genet. 2009, 10: 681690. 10.1038/nrg2615. doi: 10.1038/nrg2615View ArticlePubMedGoogle Scholar
 Marchini J, Howie B, Myers S, McVean G, Donnelly P: A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet. 2007, 39: 906913. 10.1038/ng2088. doi: 10.1038/ng2088View ArticlePubMedGoogle Scholar
 Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D: Fast linear mixed models for genomewide association studies. Nat Methods. 2011, 8: 833835. 10.1038/nmeth.1681. doi: 10.1038/nmeth.1681View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.