Towards the uniform distribution of null P values on Affymetrix microarrays
© Fodor et al.; licensee BioMed Central Ltd. 2007
Received: 11 September 2006
Accepted: 1 May 2007
Published: 01 May 2007
Methods to control false-positive rates require that P values of genes that are not differentially expressed follow a uniform distribution. Commonly used microarray statistics can generate P values that do not meet this assumption. We show that poorly characterized variance, imperfect normalization, and cross-hybridization are among the many causes of this non-uniform distribution. We demonstrate a simple technique that produces P values that are close to uniform for nondifferentially expressed genes in control datasets.
Microarray data typically involve tens of thousands of genes but only a handful of replicates. It is therefore difficult to establish appropriate P value thresholds for significance. For example, consider the response of 40,000 genes to two different experimental conditions, say diseased and healthy tissue. If a significance level of P < 0.05 is chosen, then one would expect an unacceptable number (2,000 [40,000 × 0.05]) of false positives. A conceptually simple procedure, the Bonferroni correction, would set a threshold of P = 1.25 × 10-6 (0.05/40,000). Using this P value as the threshold for significance, there is only a 0.05 chance of any false positives across all of the 40,000 comparisons between the two conditions. Such metrics are said to control the 'family-wise error rate'. Family-wise error rate is often assumed to be too conservative for microarray experiments, because there are often no results with P values below the threshold for the modest number of samples that make up most microarray experiments. Recently, 'false discovery rate' (FDR) was proposed as an alternative, more permissive approach to estimating significance of microarray experiments [1–4]. This metric acknowledges that biologists are often able to tolerate some error in gene lists. For example, a FDR could be set at 10%, in which case a list of 100 genes would be expected to have as many as 10 false positives.
No matter what threshold is used to control significance in microarray experiments, there is an inherent assumption that the P values of genes that are not differentially expressed follow a uniform distribution. For example, genes that are not differentially expressed should have a P value of 0.01 or smaller only 1% of the time. The uniform distribution of null P values seems like a safe assumption that is guaranteed by the laws of statistics. However, if for some reason this assumption is not met, then attempts to determine a threshold of significance may yield meaningless results [2, 5].
In this report we show that commonly used statistics can in fact generate distributions of P values for non-differentially expressed genes that are far from uniform. We demonstrate a simple method for producing P values that are much closer to the expected uniform distribution.
Results and discussion
RMA summation and quantile-quantile normalization suppress the pooled variance of each gene
Our central argument is that it is a rational choice to assume that, when comparing two conditions, the pooled variance of each gene on the array is approximately constant. If this assumption is true, then the distribution of scores under a t test or variant approaches the normal distribution. We begin our assertion that this assumption is reasonable by examining a control dataset released by Affymetrix. The Affymetrix HG-U133A Latin Square dataset consists of 14 'experiments' (labeled 1 to 14), each with three replicates. Each of the 14 experiments contains 42 genes that are spiked in at known concentrations against a constant background of human RNA. Of the approximately 22,000 genes on the chip, the only ones that should be different when comparing across experiments are the 42 genes that were spiked in at different concentrations. We shall refer to genes that were not spiked in as null genes, because the null hypothesis of equal expression in all conditions is true for these genes.
We argue in our report that σ2 can be thought of as approximately constant. This is clearly not true at the probe level in Figure 1a. Microarray analysis, however, is usually not performed directly at the probe level. For many microarray experiments, the desired analysis is at the gene level. A well studied problem in the analysis of Affymetrix arrays is how best to summarize the multiple probes in a probeset to produce a single value for each gene on each chip [7–10]. All of the probeset data in this report were generated with the log2-transformed Robust Multichip Average (RMA) summary statistic , which is a well regarded and robust measurement that has been shown to work well in a variety of conditions . After transformation with the RMA statistic, our data can be represented as a single spreadsheet or matrix in which the columns represent experiments and the rows represent genes.
Figure 1b shows σ2 as a function of for the approximately 22,000 probesets generated by the comparison of Latin Square experiments 1 and 2 after RMA summation. We note immediately that RMA summation suppresses the standard error. The values for probeset σ2 in Figure 1b are on the order of 10 to 20 times smaller than the probe σ2 observed in Figure 1a. In addition, we can tell by immediate inspection that the estimates of σ2 in Figure 1b must contain errors because they are not symmetrical. The data in Figure 1 are from null (not spiked in) genes. The expected value of is therefore zero and there is no reason to believe that σ2 should deviate from symmetry around zero. Clearly, in Figure 1b, however, there is a strong tendency for σ2 to be larger when exceeds zero. This must be due to some systematic error in the underlying data. RMA summation is usually accompanied by quantile-quantile normalization , which is designed to correct for systematic errors in microarray data. Figure 1c shows the relationship between σ2 and after both quantile-quantile normalization and RMA summation. We see that after quantile-quantile normalization, the standard error approaches a constant across the range of scores. In the following section we show that the deviations from a constant value of σ2 that remain after normalization and RMA summation are likely to contain errors because, even on normalized data, test statistics work better if they assume that σ2 is constant.
The measured standard error either before or after quantile-quantile normalization is unreliable
The shrinkage factor, B, can vary between 0 and 1 in Equation 3. When B = 0, Equation 3 reduces to the standard t test of Equation 1. When B = 1, the statistic essentially ignores the variance, in that it reduces to assigning a score based only on the average difference between the genes divided by a constant.
To explore the effects of variance shrinkage and normalization on statistic performance across multiple Latin Square comparisons, we choose an arbitrary threshold; we consider how many true positives are captured by each statistic for a threshold cutoff that also captures four false positives (Figure 2a, dashed vertical line). The box plots in Figure 2 show this value for each statistic over the 14 Latin Square experiments in which the spiked-in ratios differ by a factor of two in the absence (Figure 2b) and presence (Figure 2c) of quantile-quantile normalization. We note that whether one uses Bayesian statistic to weigh the variance of each gene (as in the cyber t test) or shrinks the standard error according to Equation 3 (with B approaching 1), much better performance is achieved than with the standard t test, regardless of normalization schemes. This suggests that both before and after quantile-quantile normalization, the variance reported for each gene is unreliable.
Different analysis schemes yield very different distributions of Pvalues
We have argued that quantile-quantile normalization is effective in part because it replaces the unreliable estimates of σ2 with a distribution that approaches a constant (Figures 1 and 3) and that, furthermore, test statistics appear to work better when they assume that σ2 approaches a constant (Figure 2). We now turn to the issue of how we can utilize this assumption of constant standard error to produce more accurate estimates of P values.
Given the poor performance of the standard t test in ranking differentially expressed genes (Figure 2), it is perhaps not surprising that the P values generated by the standard t test fall so far from uniform. Does the cyber t test, which clearly outperforms the standard t test in ranking differentially expressed genes (Figure 2), produce P values closer to a uniform distribution? Rather than determining σ2 independently for each gene, the cyber t test uses Bayesian statistics to weigh the variance of each gene by the variance of genes on the array with similar intensities. Because the estimate for the variance of each gene is not independent, the authors of the cyber t test do not expect the cyber t test to follow a simple t distribution with n - 2 degrees of freedom. Indeed, the P values reported by the R implementation of the cyber t test that we used are generated with an assumption of 22 degrees of freedom, given three experiments in each condition and the default parameters (see Materials and methods, below). Figure 4a (black lines) presents the P values reported by the R implementation of the cyber t test. We see that, despite the correction for lack of independence by increasing the number of degrees of freedom, the P values reported by the cyber t test are also poorly described by a uniform distribution.
If the cyber t test does not appear to follow a t distribution, then can we find a more appropriate distribution that it does follow? In Figure 1c, we have seen that σ2 approaches a constant for null genes after RMA summation and quantile-quantile normalization. The cyber t test estimates the prior variance of each gene as a function of that gene's intensity. After RMA summation and quantile-quantile normalization, that prior variance should be close to constant. Because in the Latin Square dataset we have small sample sizes, the Bayesian cyber t estimate gives a good deal of weight to the prior variance, and therefore the cyber t estimate of variance for each gene will also approach a constant. As a distribution approaches divided by a constant, it will become normally distributed. We might anticipate, therefore, that the distribution of all cyber t scores should approach a normal distribution.
Figure 4a (red line) shows that the P values produced by the normal distribution of Equation 4 fall very close to a uniform distribution. This provides strong evidence that our assertion that the cyber t estimate of σ2 is approximately constant is reasonable. For the rest of this report, we refer to the method of generating P values from the cyber t test by assuming a normal distribution as 'cyber-t-Normal'. We emphasize that there are two differences between P values produced by cyber-t-Normal and the P values reported by the cyber t test. One is that we assume a normal distribution rather than a t distribution. The other is that we calculate the P value for each gene by comparison with a distribution of all genes on the array. That is, we assume that all the genes on the array follow a single distribution whereas the P values produced by the cyber t test are generated under the assumption that each gene follows its own independent distribution based on the Bayesian estimate of σ2 for that gene.
How close are the P values produced by the cyber-t-Normal scheme to a uniform distribution? We can use the Kolmogorov-Smirnov test to evaluate the null hypothesis that the distribution of P values from each statistic is identical to the uniform distribution of P values. The Kolmogorov-Smirnov test is a nonparametric test and can therefore suffer from low power. On the other hand, we are using the test to evaluate a distribution with over 22,000 data points, and so we are confident that even small deviations from our assumptions will produce small P values. Figure 4b,c shows the -log10 of the P value of the Kolmogorov-Smirnov test for all 14 possible 2× Latin Square comparisons. We see that, although there is considerable variability across all 14 pairs of experiments, P values produced by the cyber-t-Normal method are a good deal closer to uniform than P values produced by either the standard t or cyber t methods.
Imperfect normalization contributes to deviations from a perfectly normal distribution
Cross-hybridization also contributes to deviations from a perfectly normal distribution
Experiments consisting of technical replicates are closer to a normal distribution
Technical replicates consist of arrays that have been exposed to identical RNA. Every gene within a comparison of technical replicates is therefore a null gene. If some of the deviation from a uniform distribution in Figure 4 were caused by cross-hybridization, then we would anticipate that experiments consisting entirely of technical replicates would be closer to a uniform distribution. The sample sizes in the Latin Square experiment shown in Figure 4 are n = 3 for each condition, however, which does not allow for comparison within an experimental condition by either the cyber t or standard t test. Fortunately, a dataset with six technical replicates has been published . This dataset, which was designed to measure the effect of different RNA amplification schemes, consists of six technical replicates in each of four distinct groups for a total of 24 arrays. Within each of the four groups, there are 10 possible ways to split the six technical replicates into two groups of three. There are therefore a total of 40 distinct comparisons of technical replicates with n1 = n2 = 3 within the 24 arrays of this dataset.
'Scheme 4' should be conservative in real experiments
The graphs in Figures 4 to 7 were created using data from only null genes, which we know are not differentially expressed. In 'real' experiments, of course, we will have a mixture of null and not-null genes and we will not know which genes are null and which are differentially expressed. When we compare genes in two conditions, we assume that null genes will follow a normal distribution of scores whereas genes that are not null will not follow this same distribution. Because the majority of genes are probably null, the overall distribution of scores from a test statistic will largely reflect null genes. We measure the significance of genes as deviations from this background distribution of presumably null genes. Of course, not all of the genes will be null, and we will therefore not be able to measure and σcyberTNulls (the average and standard deviation of cyber t scores from null genes) but only and σcyberTAll, which we define as the observed mean and standard deviation of cyber t scores for all genes.
Moreover, cyber t scores will be higher for not-null genes than for null genes, and we therefore expect:
σcyberTAll > σcyberTNulls
Estimates of P values generated by Equation 4 with and σcyberTAll will therefore tend to be larger than P values that would be calculated with and σcyberTNulls from only the null genes. As more and more genes are differentially expressed between two samples, conclusions based on the P values generated by Equation 4 should therefore become more conservative.
Scheme 4 has attractive sensitivity and specificity when controlling false discovery rate
In order to compile a list of genes that are differentially expressed between conditions, one requires not only a set of P values but also some way to set a significance threshold controlling for family-wise error rate or FDR. There are a large number of reasonable choices that one could make in determining a threshold for significance [3, 4, 11, 17]. In this report, we choose to set a threshold for significance using the Benjamini and Hochberg algorithm , which is a simple and popular method for controlling FDR.
On biologic replicates, scheme 4 yields conservative, but reasonable, estimates of significant genes
Number of genes called significant at 10% false discovery rate on isogenic biologic replicates
Standard t (scheme 1)
Cyber t (scheme 2)
Although we have argued that the scheme 4 route to FDR should be conservative, given the tight correlation shown in Figure 9, it seems possible that we have over-estimated the number of genes that are truly differentially expressed. Is an assertion that there are in fact no differentially expressed genes in these experiments correct? We can take advantage of our experimental design to rule out that possibility. Of the 90 genes that are significant by scheme 4 based FDR between treatment and control (Table 1) for EB samples at 4 hours, 15 are also differentially expressed in the 87 found significant by scheme 4 in the EB samples at 24 hours. We can use the hypergeometric distribution to reject the null hypothesis that the genes found to be significant in EB samples at 4 hours are unrelated to the genes found to be significant in EB samples at 24 hours with P < 10-25. A significant fraction of the genes found to be significant with scheme 4 are therefore reproducibly differentially expressed across the 4-hour and 24-hour time points.
Likewise, of the 38 genes found to be significant by scheme 4 for MY samples at 24 hours, 15 are also differentially expressed in the 268 genes found to be significant by scheme 4 in the MY samples at 4 hours (P < 10-24). We have some evidence, therefore, that the scheme 4 route was not inappropriately anticonservative in this analysis. That is, at least some of the genes described by scheme 4 were indeed differentially expressed.
In this paper we argue that microarray statistics work best when the estimate of standard error from each gene on the array is ignored or suppressed. We are not the first group to suggest that estimates of variance from individual genes are unreliable. Previous studies have noted improved statistics when a constant is added to the variance [1, 11] or weighted by the variance from neighboring genes . We argue that if the variance from each gene is truly unknown, then it makes sense to consider all of the genes on the array as arising from a single, normal distribution. We have demonstrated that this assumption of a single normal distribution of all genes comes much closer to producing a uniform distribution of P values than does production of P values from the t distribution (Figures 4 and 7).
It is not immediately clear why algorithms, such as the standard t test, that attempt to estimate the standard error of each individual gene perform so poorly. Difficulties in accurately estimating the variance of each individual gene may arise because of the modest sample sizes in typical microarray experiments. It has also been shown that the normalization process may distort the variance of genes [20, 21]. We have seen, however, that the standard t test does much better on quantile-quantile normalized data than on non-normalized data (Figure 2), even though only about 10% of the original estimate of standard error survives normalization (Figure 3). The distortion of variance by normalization is apparently helpful. We argue that it is helpful because it replaces the original estimate of standard error for each gene with a distribution that approaches a small constant (Figure 1). This is consistent with the true variance for each gene being unknown regardless of normalization procedures.
The cyber t test uses Bayesian statistics to weigh the variance of each gene by the variance of genes with similar intensities on the array (see Materials and methods, below). As the experimental sample size increases, the weight given to the measured variance is increased while the weight given to the variance shared among similar genes is decreased. At large sample sizes, the performance of the cyber t test will therefore approach the performance of the standard t test. This behavior of the cyber t test is appropriate if the measured variance approaches the true variance as sample size increases. If, however, there are other factors at work in addition to small sample size that cause the measured variance to be unreliable, then the performance of the cyber t test may degrade as the sample size increases and the weight assigned to the background variance is therefore diminished. There is an urgent need for control datasets with larger sample sizes to determine whether the unreliability of the measured variance is primarily a function of small sample size or is somehow being caused by other aspects of microarray technology.
A recent controversy in the microarray literature has centered directly on the assumption of the uniform distribution of null P values. In analyzing a spike in dataset, Choe and coworkers  found that predicted FDRs from the SAM  algorithm appeared to be greatly anticonservative when compared with actual FDRs. In response, Dabney and Storey  noted that the anticonservative behavior of the SAM algorithm could be explained by the non-uniform distribution of P values among the non-spiked-in genes. In the Choe dataset, non-spiked-in genes had a surprising tendency to have P values too close to zero. Dabney and Storey argued that this non-uniform distribution was caused by errors in the experimental design of the spike-in dataset, a charge that was echoed somewhat by a second reanalysis of the Choe dataset . These charges have been vigorously disputed by the authors of the Choe dataset, who argue that the non-uniform distribution of P values may be a common feature of microarray data [14, 15].
Our study lends support to the arguments presented by Choe and coworkers. There are only 42 genes spiked in to the Latin Square dataset, but even this modest number of genes can produce detectable distortions in the distribution of P values among null genes (Figure 6). Given that the Choe dataset includes more than a thousand spiked-in genes, it is not surprising that the null genes in the Choe dataset have profoundly distorted P values. Moreover, the original analysis of the Choe dataset used cyber t , whereas the reanalysis used a standard t test . We have shown that both of these tests can distort the distribution of null P values (Figure 4 and 7). In their reports [13, 15], Choe and coworkers suggest multiple normalization steps as a way to avoid bias in the test statistics. We find that a second normalization step does make a small difference in producing uniform P values (Figures 4 and 7). We argue, however, that a larger difference can be made by finding a more appropriate distribution of microarray scores than the t distribution.
A problem with all microarray statistics papers is that they are dependent on the datasets analyzed. It is a constant worry that the assumptions made with regard to one dataset will not apply to new datasets in the future, that is to say that one has, in effect, constructed a statistic that is 'over-trained' to the datasets considered. The main assumption that we have made in this paper is that it reasonable to treat the standard error from each gene as a constant. This assumption appears to be reasonable for the Latin Square and technical replicate data we have examined (Figures 4 and 7). It is not, however, a perfect assumption. The distribution of P values observed in Figures 4 and 7 are not perfectly uniform. This assumption is clearly more reasonable, however, than the assumptions used to generate the P values for the standard t and cyber t tests, because P values produced by these tests are far from uniform (Figures 4 and 7). Genes in datasets that contain biologic replicates will, of course, exhibit a greater degree of variance than genes in the technical replicates that, by necessity, make up control datasets. Despite this, our assumptions appear to produce more reasonable results when applied to a 'real' biologic dataset than the assumptions of the cyber t, standard t, or SAM procedures (Table 1).
We have seen that even within the Latin Square dataset, cross-hybridization can affect probe sets that are annotated as null, distorting P values and complicating FDRs (Figure 6). Microarray experiments are prone to other artifacts, which are incompletely understood. These include saturation of probes at high signal , nonequilibrium hybridization conditions , and artifacts that arise from the dyes used in microarray experiments . A recent study found that different laboratories performing the same microarray experiment on the same RNA sample obtained large differences in their results, although the results from the best performing laboratories exhibited a greater degree of correlation . Given such a challenging environment, calculation of accurate FDRs remains a difficult proposition. We argue that because FDR calculations are liable to be distorted by subtle artifacts, one should err on the conservative side. We have taken a simple approach and shown that it is possible to generate a reasonable set of P values in a way that should become more conservative as differences increase between sets of chips. In the many cases where a conservative statistic is appropriate, we believe this approach may yield more reasonable gene lists than other currently employed methods.
Materials and methods
Implementation of statistics
The uniform distribution of P values in Figures 4, 6, and 7 was calculated as simply the inverse of the gene index. So, for example, if there were 22,000 genes in a list ordered by statistic score, then the expected P value for the first gene under a uniform distribution was 1/22,000. The expected P value for the second gene was 2/22,000 and so forth.
Background correction, quantile-quantile normalization, and RMA summary values were calculated with RMA express . In cases in which data were not normalized, background subtraction was also not performed, but RMA summary values were still generated on non-normalized data with RMA express. All RMA values are reported on a log2 scale.
The HG-U133A Latin Square dataset was downloaded from Affymetrix (Santa Clara, CA, USA) . For the Latin Square data sets, probe sets 209374_s_at, 205397_x_at, and 208010_s_at were excluded for all analyses, as instructed by the HG-U133A_tag_Latin_Square.xls spreadsheet. We also excluded any probe set not in the spike-in probe sets that started with AFFX-. This left 42 true positives and 22,182 true negatives.
Where n is the sample size (the number of arrays in the condition), SD is the standard deviation as it is usually calculated, and SDWindow is the average of the standard deviation of the 100 genes with the average intensity closest to the average intensity of the gene under consideration. The cyber t score is then calculated in the same way as the standard t test, with the SDcyberT value for each condition replacing the conventional standard deviation for each condition and an adjusted degrees of freedom of 20 + n1 + n2 - 4 (where n1 is the number of array is condition 1 and n2 is the number of arrays in condition 2). For more details, see the cyber t report  and web page .
The Benjamini and Hochberg algorithm  was implemented in Java. The predicted FDR rate for a given gene in a gene list ordered by statistic P value is given by N × p(k)/k, where N is the number of genes in the list and p(k) is the P value produced by the test statistic under the null hypothesis of no differential expression for gene k in the list.
The cdf (cumulative distribution function) function in Equation 4 was evaluated using the pnorm function in the class StatFunction.java implemented by Sundar Dorai-Raj and downloaded from the Dorai-Raj web page . This function yields equivalent values to the R function pnorm with lower.tail = FALSE.
The Kolmogorov-Smirnov test was ported to Java from the Numerical Recipes in C++ text . The Java port was tested against the ks.test method in R. In cases in which the Kolmogorov-Smirnov test returned a zero, the -log10 value was set to 200 (Figure 4) or 350 (Figures 6 and 7).
Loess regression lines were generated by the Java class Lowess. java in the package org.tigr.midas.engine distributed as part of the TIGR midas engine .
All statistics except cyber t and the results of RMA express were implemented in Java. Implementations of the equations presented in this report can be found in the supplementary materials (Additional data file 11) and at the author's web page .
Mouse embryonic stem cells were differentiated into isogenic bursting embryoid body (EB) cells or isogenic myeloid (MY) hematopoietic cells. A population of about 106 EB or MY hematopoietic cells were seeded onto five replicate dishes and expanded to obtain the appropriate number of cells per plate for treatment. About 1.5 × 107 EB or MY hematopoietic cells were exposed to either DMSO plus etoposide at a final concentration of 50 μmol/l or to DMSO (control) for 60 min. Each etoposide or DMSO control was performed on the five replicate dishes. The etoposide stock solution or DMSO (control) was diluted in Iscove's modified Dulbecco's medium supplemented with 10% non-ES-qualified fetal bovine serum. Following etoposide or DMSO (control) exposure, all samples were washed twice in 1× phosphate-buffered saline and plated in fresh medium for a recovery period of 4 or 24 hours. Following recovery all cells were washed twice in 1× phosphate-buffered saline and harvested for RNA isolation.
RNA isolation and processing for microarrays
Control and etoposide-treated cells were pelleted by centrifugation and lysed in TRIzol Reagent (Invitrogen, Carlsbad, CA, USA; 1 ml per 10 × 106 cells) by repetitive pipetting followed by incubation at room temperature for 5 min. Total RNA was recovered by phenol-chloroform extraction and isopropyl alcohol precipitation. Extracted RNA was further purified using the RNeasy mini kit (Qiagen, Valencia, CA, USA). Biotin-labeled cDNA was prepared from the purified RNA samples using the Ovation™ Biotin RNA Amplification and Labeling System (NuGEN Technologies, Inc., San Carlos, CA, USA), in accordance with the manufacturer's protocol. Briefly, first-strand and second-strand cDNA synthesis was followed by amplification of the double-stranded DNA template. Amplified cDNA was then fragmented and labeled with biotin. Biotin-labeled cDNA was purified using the DyeEx 2.0 Spin Kit (Qiagen), and product yield and purity were determined u A260, A280, and A320 spectrophotometric measurements. Fragmented, biotin-labeled cDNA (2.2 μg) from each sample was hybridized to a GeneChip Mouse Genome 430 2.0 array (Affymetrix, Inc.). The Mouse Genome 430 2.0 array contains 45,000 probe sets used to analyze the expression level of over 39,000 transcripts from over 34,000 mouse genes. Hybridization, washing, staining, and scanning of microarrays was performed by the Gene Chip Analysis Facility in the Institute for Cancer Genetics, Columbia University Health Sciences Division, New York, USA.
Additional data files
The following additional data are available with the online version of this manuscript.Raw data (in the form of 40 .cel files) for the Etoposide experiments described in Table 1 can be found in the supplementary materials in this paper (Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 compressed with bzip2). Filenames within the zip files that start with MY indicate myeloid hematopoietic cells while filenames starting with EB indicate bursting embryoid bodies. The third character of each file indicates treatment with drug + DMSO ("E") or just control DMSO ("C"). The next character of each filename indicates the replicate number. The final characters indicate the time point. So, for example, "MYE34.CEL" indicates a myeloid hematopoietic cells ("MY") treated with drug ("E"), replicate number 3, 4 hour time point. "EBC124.CEL" indicates bursting embryoid bodies ("EB"), treated with only DMSO ("C"), replicate number 1, 24 hour time point. Additional data file 11 provides implementations of the equations presented in this report in R.
We thank two anonymous reviewers for insightful comments. We thank Jonathan McCafferty and Richard Francis for technical assistance, and Ed Boyden and Alex Gordon for critical comments on an earlier version of this manuscript. Rafael A Irizarry generously provided the .CEL files for the technical replicates shown in Figure 7.
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100: 9440-9445. 10.1073/pnas.1530509100.PubMedPubMed CentralView ArticleGoogle Scholar
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001, 29: 1165-1188. 10.1214/aos/1013699998.View ArticleGoogle Scholar
- Reiner A, Yekutieli D, Benjamini Y: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003, 19: 368-375. 10.1093/bioinformatics/btf877.PubMedView ArticleGoogle Scholar
- Dabney AR, Storey JD: A reanalysis of a published Affymetrix GeneChip control dataset. Genome Biol. 2006, 7: 401-10.1186/gb-2006-7-3-401.PubMedPubMed CentralView ArticleGoogle Scholar
- Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002, S96-S104. Suppl 1Google Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31: e15-10.1093/nar/gng015.PubMedPubMed CentralView ArticleGoogle Scholar
- Irizarry RA, Wu Z, Jaffee HA: Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2006, 22: 789-794. 10.1093/bioinformatics/btk046.PubMedView ArticleGoogle Scholar
- Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006, 7: 55-65. 10.1038/nrg1749.PubMedView ArticleGoogle Scholar
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t test and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509-519. 10.1093/bioinformatics/17.6.509.PubMedView ArticleGoogle Scholar
- Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005, 6: R16-10.1186/gb-2005-6-2-r16.PubMedPubMed CentralView ArticleGoogle Scholar
- Gaile DP.;Miecznikowski JC, Choe SE, Halfon MS: Putative Null Distributions Corresponding to Tests of Differential Expression in the Golden Spike Dataset are Intensity Dependent. Technical Report 06-01. 2006, Buffalo, NY: Department of Biostatistics, State University of New YorkGoogle Scholar
- Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Correspondence: response to Dabney and Storey. Genome Biol. 2006, 7: 401-10.1186/gb-2006-7-3-401.View ArticleGoogle Scholar
- Cope L, Hartman S, Gohlmann H, Tiesman J, Irizarry RA: Analysis of Affymetrix GeneChip Data Using Amplified RNA: Working Paper 84. 2005, Baltimore, MA: Department of Biostatistics, Johns Hopkins UniversityGoogle Scholar
- Qiu X, Klebanov L, Yakovlev A: Correlation between gene expression levels and limitations of the empirical bayes methodology for finding differentially expressed genes. Stat Appl Genet Mol Biol. 2005, 4: 34-Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B (Methodological). 1995, 57: 289-300.Google Scholar
- Wu C, Carta R, Zhang L: Sequence dependence of cross-hybridization on short oligo microarrays. Nucleic Acids Res. 2005, 33: e84-10.1093/nar/gni082.PubMedPubMed CentralView ArticleGoogle Scholar
- Parrish RS, Spencer HJ: Effect of normalization on significance testing for oligonucleotide microarrays. J Biopharm Stat. 2004, 14: 575-589. 10.1081/BIP-200025650.PubMedView ArticleGoogle Scholar
- Hoffmann R, Seidl T, Dugas M: Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biol. 2002, 3: research0033.1-research0033.11. 10.1186/gb-2002-3-7-research0033.Google Scholar
- Irizarry RA, Cope L, Wu Z: Feature-level exploration of the Choe et al. Affymetrix Genechip control dataset (Working Paper 102). [http://www.bepress.com/jhubiostat/paper102]
- Naef F, Socci ND, Magnasco M: A study of accuracy and precision in oligonucleotide arrays: extracting more signal at large concentrations. Bioinformatics. 2003, 19: 178-184. 10.1093/bioinformatics/19.2.178.PubMedView ArticleGoogle Scholar
- Sartor M, Schwanekamp J, Halbleib D, Mohamed I, Karyala S, Medvedovic M, Tomlinson CR: Microarray results improve significantly as hybridization approaches equilibrium. Biotechniques. 2004, 36: 790-796.PubMedGoogle Scholar
- Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys. 2003, 68: 011906-PubMedView ArticleGoogle Scholar
- Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, Frank BC, Gabrielson E, Garcia JG, Geoghegan J, Germino G, et al: Multiple-laboratory comparison of microarray platforms. Nat Methods. 2005, 2: 345-350. 10.1038/nmeth756.PubMedView ArticleGoogle Scholar
- RMA Express. [http://rmaexpress.bmbolstad.com/]
- Affymetrix Latin Square Data. [http://www.affymetrix.com/support/technical/sample_data/datasets.affx]
- Cyber-T. [http://visitor.ics.uci.edu/cgi-bin/genex/cybert/CyberTReg-8.0.form.pl]
- Saeed AI, Bhagabati NK, Braisted JC, Liang W, Sharov V, Howe EA, Li J, Thiagarajan M, White JA, Quackenbush J: TM4 Microarray Software Suite. Methods Enzymol. 2006, 411: 134-193.PubMedView ArticleGoogle Scholar
- Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, et al: TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 2003, 34: 374-378.PubMedGoogle Scholar
- TM4 Microarray Software Suite. [http://www.tm4.org/mev.html]
- The Virtual Home of Sundar Dorai-Raj. [http://www.stat.vt.edu/~sundar/java/code/StatFunctions.html]
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C++: The Art of Scientific Computing. 2002, Cambridge, UK: Cambridge University PressGoogle Scholar
- Anthony Fodor's home page. [http://www.afodor.net]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.