A reanalysis of a published Affymetrix GeneChip control dataset
© BioMed Central Ltd 2006
Published: 22 March 2006
A response to Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset by SE Choe, M Boutros, AM Michelson, GM Church and MS Halfon. Genome Biology 2005, 6:R16.
In a recent Genome Biology article, Choe et al.  described a control dataset for Affymetrix GeneChips. By spiking RNA at known quantities, the identities of all null and differentially expressed genes are known exactly, as well as the fold change of differential expression. With the wealth of analysis methods available for microarray data, a control dataset would be very useful. Unfortunately, serious errors are evident in the Choe et al. data, disproving their conclusions and implying that the dataset cannot be used to validly evaluate statistical inference methods. We argue that problems in the dataset are at least partially due to a flaw in the experimental design.
q-value estimates incorrectly criticized
Figure 8 in Choe et al.  suggests that estimated q-values substantially underestimate the true q-values. For example, Figure 8b shows that, when a q-value cutoff of 0.20 is used, the true q-value is closer to 0.60. This reflects a serious error somewhere in the analysis. The question is whether the error occurs prior to, or as a result of, the q-value calculations.
False-discovery rates were originally proposed by Soric  and Benjamini and Hochberg . The q-value was developed as the FDR analog of the p-value [4–7]. There is sound statistical justification behind both FDR and q-value methods; that is, there is rigorous mathematical proof for their claimed operating characteristics. For example, conditions have been detailed where the q-value estimates are guaranteed to be (simultaneously) conservative [5, 7]. This means that, if a gene is assigned an estimated q-value of 0.05, then its true, population average, q-value is no larger than 0.05.
Note that the q-value methodology employed by Choe et al.  was credited to the SAM method proposed by Tusher et al. , even though the q-value methodology was developed separately from the SAM method in references [4–7]. The SAM software (different from the SAM method ) is based on at least four different articles [4, 8–10], each of which contributes unique methodology. Failing to differentiate between the SAM method of Tusher et al. and the SAM software seems to have caused a good deal of confusion in the literature when evaluating and understanding the operating characteristics of the methods in the software .
We now show that the fundamental requirements for employing q-values (and even p-values) are not satisfied by Choe et al.'s dataset, leading to spurious conclusions. These requirements are stated in each of the original papers detailing the q-value methodology [4–7]. In particular, the q-value methodology draws on the fact that the p-values for null genes should be uniformly distributed on the interval (0,1) . As stated by Storey and Tibshirani : "If the null p-values are not uniformly distributed, then one wants to err in the direction of overestimating p-values (that is, underestimating significance). Correctly calculated p-values are an important assumption underlying our methodology." This is not a special requirement for q-values - it is a requirement for any type of standard significance criterion .
Problems with the experimental design
The question remains as to the cause of the p-values being incorrect. One source of the problem is the experimental design itself. Consider Figure 1 in Choe et al. . In column three, there are three aliquots of labeled cRNA clones. Each of these aliquots is divided, and one half is then spiked-in with known concentrations of RNA among the genes selected to be differentially expressed. Because random variability is introduced in the spike-in step (between columns 3 and 4 in Figure 1 of Choe et al. ), even null genes will have some differences in RNA amount. That is, both halves of each aliquot undergo some modification (even the control half), leading to random variation being introduced to the RNA amounts of all genes. This leaves three matched pairs of independent samples, where some variation exists within pairs for all genes.
A major flaw occurs when the three samples from each treatment (control or spike-in) are combined into two aliquots in column 4. Now, instead of three independent matched pairs, there is only a single matched pair in column 4. Each half of this single matched pair is hybridized to three chips. Therefore, the random variation introduced at the spike-in stage will not be detectable among the six resulting chips. If every chip is treated as an independent observation, then the variation introduced in the previous step among null genes will appear to be true signal. Unfortunately, this is exactly what Choe et al. do, leading to the incorrect p-values that we observed earlier. This problem cannot be fixed by modifying the statistics.
Consider the following scenario, which suffers from the same problem as Choe et al.'s design but is phrased in more familiar terminology. Suppose we are interested in differences in gene expression for two biological groups under two conditions, A and B. To this end, we obtain expression measurements for a single individual (that is, a single biological replicate) under both conditions. On the basis of the sample from this single individual, we then form three replicated chips for each condition.
By chance, there will be small differences in the expression measurements under both conditions A and B. With a proper estimate of the variability, these differences will be identified as being random and the associated tests called null. The problem is that we cannot use our single individual to estimate the true variability of expression measurements under conditions A and B, taking into account all sources of variation. If we treat the six replicated observations as three independent matched pairs, then our estimated variance is due only to random aspects of the hybridization process. This will significantly underestimate the true variances, and, as a result, we will grossly inflate the significance of differential expression - even among null genes.
A large-scale "spike-in" control dataset would be invaluable for head-to-head comparisons of statistical methods for microarrays. The Choe et al. dataset was intended to serve this purpose. Unfortunately, the data set is flawed in that even the null genes appear to be differentially expressed. As a result, the dataset cannot be relied upon for evaluating statistical inference methods. We note further that, when applying statistical methods such as the q-value estimates, one must take care to ensure that all necessary assumptions are met.
Sung E Choe, Michael Boutros, Alan M Michelson, George M Church and Marc S Halfon respond:
One of the main purposes of our paper  was to challenge the community to improve on existing microarray analysis methods and to promote a better understanding of the experimental conditions for which these methods are appropriate. Dabney and Storey make a valuable contribution to this effort with their clarification as to why our q-value calculations underestimate the false-discovery rate by noting that the underlying p-value distributions for the "null genes" - those with no expected fold change between the S spike (S) and control (C) samples - are non-uniform in each of the 10 best datasets considered in our original manuscript . Dabney and Storey also provide simulation results to suggest that the problem is due to our use of technical replicates. However, we demonstrate here that this aspect of our design is sound, and that their model is consistent neither with our actual experimental design nor with the observed distribution of the null p-values. We also offer a more likely explanation for the problems that they note.
The preceding analysis suggests that the observed non-uniformity of the p-values is not easily explained by a simple model. Although relevant mainly to just one aspect of our study (regarding q-value estimates), this is an important and complex issue in need of further investigation, and we are grateful to Dabney and Storey for bringing it to our attention. Whether these non-uniform p-values are manifest in other datasets or are merely a byproduct of the unbalanced signal strengths present in our experiment also remains an open question to be addressed by future studies. However, in most microarray experiments it is impossible to check if the null p-values are uniformly distributed (as the true set of null genes is not known) and it is therefore impossible to determine whether or not the requirements for accurate estimation of q-values are met. We therefore caution that the preprocessing issues seen here might be relevant in other microarray experiments. We can easily conceive of biological conditions in which imbalances similar to those in our original study could exist - for example, when comparing different tissue types, in certain developmental time courses, or in cases of immune challenge.
Correspondence should be sent to Marc S Halfon: Department of Biochemistry and Center of Excellence in Bioinformatics and the Life Sciences, State University of New York at Buffalo, Buffalo, NY 14214, USA. E-mail: firstname.lastname@example.org
Additional data files
Additional data file 1 available with this paper provides details of the simulations reported here.
We are indebted to Daniel Gaile and Jeffrey Miecznikowski of the University at Buffalo Department of Biostatistics for the analysis presented in Figures 6-8 and for their permission to use material from  in this response.
- Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005, 6: R16-10.1186/gb-2005-6-2-r16.PubMedPubMed CentralView ArticleGoogle Scholar
- Soric B: Statistical discoveries and effect-size estimation. J Am Stat Ass. 1989, 84: 608-610.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc, Ser B. 1995, 57: 289-300.Google Scholar
- Storey JD: A direct approach to false discovery rates. J Roy Stat Soc, Ser B. 2002, 64: 479-498. 10.1111/1467-9868.00346.View ArticleGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genome-wide studies. Proc Natl Acad Sci USA. 2003, 100: 9440-9445. 10.1073/pnas.1530509100.PubMedPubMed CentralView ArticleGoogle Scholar
- Storey JD: The positive false discovery rate: A Bayesian interpretation and the q -value. Annls Stat. 2003, 31: 2013-2035. 10.1214/aos/1074290335.View ArticleGoogle Scholar
- Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach. J Roy Stat Soc, Ser B. 2004, 66: 187-205. 10.1111/j.1467-9868.2004.00439.x.View ArticleGoogle Scholar
- Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5124. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment. J Am Stat Ass. 2001, 96: 1151-1160. 10.1198/016214501753382129.View ArticleGoogle Scholar
- Taylor J, Tibshirani R, Efron B: The miss rate for the analysis of gene expression data. Biostatistics. 2005, 6: 111-117. 10.1093/biostatistics/kxh021.PubMedView ArticleGoogle Scholar
- Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments. Stat Sci. 2003, 18: 71-103. 10.1214/ss/1056397487.View ArticleGoogle Scholar
- Lehmann EL: Testing Statistical Hypotheses. 1997, Berlin: SpringerGoogle Scholar
- Rice JA: Mathematical Statistics and Data Analysis. 1995, Belmont California: Duxbury Press, 2Google Scholar
- Affymetrix: GeneChip Expression Analysis: Technical Manual. 2004, Santa Clara, CA: AffymetrixGoogle Scholar
- Gaile DP, Miecznikowski JC, Choe SE, Halfon MS: Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent. Technical report 06-01. 2006, Buffalo, NY: Department of Biostatistics, State University of NewYork, [http://sphhp.buffalo.edu/biostat/research/techreports/UB_Biostatistics_TR0601.pdf]Google Scholar