Statistical methods for ranking differentially expressed genes
© Broberg; licensee BioMed Central Ltd. 2003
Received: 9 December 2002
Accepted: 7 May 2003
Published: 29 May 2003
In the analysis of microarray data the identification of differential expression is paramount. Here I outline a method for finding an optimal test statistic with which to rank genes with respect to differential expression. Tests of the method show that it allows generation of top gene lists that give few false positives and few false negatives. Estimation of the false-negative as well as the false-positive rate lies at the heart of the method.
Microarray technology has revolutionized modern biological research by permitting the simultaneous study of genes comprising a large part of the genome. The blessings stemming from this are accompanied by the curse of high dimensionality of the data output. The main objective of this article is to explore one method for ranking genes in order of likelihood of being differentially expressed. Top gene lists, that give few false positives and few false negatives, are the output. As the interest is mainly in ranking for the purpose of generating top gene lists, issues such as calculation of p-values and correction for multiple tests are of secondary importance.
Microarrays have an important role in finding novel drug targets; the thinking that guides the design and interpretation of such experiments has been expressed by Lonnstedt and Speed : "The number of genes selected would depend on the size, aim, background and follow-up plans of the experiment." Often, interest is restricted to so-called 'druggable' target classes, thus thinning out the set of eligible genes considerably. It is generally sensible to focus attention first on druggable targets with smaller p-values (where the p-value is the probability of obtaining at least the same degree of differential expression by pure chance) before proceeding to ones with larger p-values. In general, p-values have the greatest impact on decisions regarding target selection by providing a preliminary ranking of the genes. This is not to say that multiplicity should never be taken into account, or that the method presented here replaces correction for multiplicity. On the contrary, the approach provides a basis for such calculations (see Additional data files).
The approach presented here could be applied to different types of test statistics, but one particular type of recently proposed statistic will be used. In Tusher  a methodology based on a modified t-statistic is described:
It is more relevant to optimize with respect to false-positive and false-negative rates. This is the basic idea behind the new approach. The idea is to jointly minimize the number of genes that are falsely declared positive and the number of genes falsely declared negative by optimizing over a range of values of the significance level a and the fudge constant S0. How well this is achieved can be judged by a receiver operating characteristics (ROC) curve, which displays the number of false positives against the number of false negatives expressed as proportions of the total number of genes.
An alternative to the statistic (1) is d = diff/√(S02 + S2), or d = diff/√(wS02 + (1 - w)S2) for some weight w, which is basically the statistic proposed in Baldi . Its performance appears to be very similar to that of (1) (data not shown). A software implementation in R code within the package SAG [5, 6] is available from  via the function samroc.
False discovery rate (FDR) may be defined as the proportion of false positives among the significant genes, see . False-positive rate (FP) may be defined as the number of false positives among the significant genes divided by the total number of genes. Similarly, we define the false-negative rate (FN) as the number of false negatives among the nonsignificant genes divided by the total number of genes, the true-positive rate (TP) as the number of true positives divided by the total number of genes, and, the true-negative rate (TN) as the number of true negatives divided by the total number of genes.
The unknown distribution of true and false positives and negatives
1 - p(α)
Assume that we can, for every combination of values of the significance level α and the fudge constant S0, calculate (FP, FN). The goodness criterion is then formulated in terms of the distance of the points (FP, FN) to the origin (which point corresponds to no false positives and no false negatives, see Figure 2), which in mathematical symbols may be put as
The optimal value of (α, S0) will be the one that minimizes (2). It is for practical reasons not possible to do this minimization over every combination, so the suggestion is to estimate the criterion over a lattice of (α, S0) values and pick the best combination.
If one has an assessment regarding the relative importance of FP and FN, that may be reflected in a version of the criterion (2) that incorporates a weight λ that reflects the relative importance of FP compared to FN: Cλ = √(λ2 FP2 + FN2). The choice λ = (1 - p0)/p0 corresponds to another type of ROC curve, which displays the proportion of true (TP/(1 - p0)) against the proportion of false (FP/p0) (see Additional data files). Other goodness criteria are possible, such as the sum of FP and FN or the area under the curve in Figure 2. For more details and other approaches see, for example [8, 9].
Using the permutation method to simulate the null distribution (no change) we can obtain a p-value for a two-sided test, as detailed below. Loosely speaking, in each loop of the simulation algorithm the group labels are randomly rearranged, so that random groups are formed, the test statistic is calculated for this arrangement and the value is compared to the observed one. How extreme the observed test statistic is will be judged by counting the number of times that more extreme values are obtained from the null distribution.
The data matrix Xhas genes in rows and arrays in columns. Consider the vector of group labels fixed. The permutation method consists of repeatedly permuting the columns (equivalent to rearranging group labels), thus obtaining the matrix X*, and calculating the test statistic for each gene and each permutation. Let d(j)*kbe the value of the statistic of the jth gene in the kth permutation of columns. Then the p-value for gene i equals
where M is the number of genes, d(i) the observed statistic for gene i, B the number of permutations and '#' denotes the cardinality of the set [2, 10, 11]. In words, this gives the relative frequency of randomly generated test statistics with an absolute value that exceeds the observed value of gene i. The formula (3) combines the permutation method in  and the p-value calculation in . These p-values are such that a more extreme value of the test statistic will yield a lower p-value.
Given the significance level α (p-values less than α are considered significant), the proportion of the genes considered differentially expressed is
which is the relative frequency of genes with a p-value less than α.
The current version of samroc uses the estimate
where q X is the X% percentile of the d* (compare ). This estimate makes use of the fact that the genes whose test statistics fall in the quartile range will be predominantly the unchanged ones. More material on this matter is in the Additional data files.
which is the proportion of unchanged genes multiplied by the probability that such a gene produces a significant result. For a derivation see the Additional data files.
From Table 1 one obtains, as outlined in the Additional data files,
FN = 1 - p0(1 - α) - p(α) (6)
To get an intuitive feel for this equality, just note that the second term is the proportion unchanged multiplied by the probability of such genes not being significant, which estimates TN, and that the third term corresponds to the positive (TP + FP). Subtracting the proportion of these two categories from the whole will leave us with the FN.
Estimating the criterion
The entities we need for the optimisation are given by the estimates
A scatter plot of the estimate of the criterion
A detailed account of the results is given in the Additional data files, where datasets, data preprocessing, analysis and results are described in enough detail to enable the results to be reproduced.
When testing methods in this field it is difficult to find suitable data for which something is known about the true status of the genes. If one chooses to simulate, then the distributions may not be entirely representative of a real-life situation. If one can find non-proprietary real-life data, then the knowledge as to which genes are truly changed may be uncertain. To provide adequate evidence of good performance it is necessary to provide such evidence under different and reproducible conditions.
In the comparison, samroc, t-test, Wilcoxon, the Bayesian method in , and SAM  were competing. By the t-test I mean the unequal variance t-test: t = (mean1 - mean2)/√(s12/n1 + s22/n2) for sample means mean1 and mean2 and sample variances s12, s22. The Wilcoxon rank sum test is based on the sum W s of the ranks of the observations in one of the groups W s = R1...+ Rn1 . The Bayesian method calculates the posterior odds for genes being changed (available as functions stat.bay.est.lin in the R package SAG, and stat.bayesian in the R package sma [1, 5]). The method starts from the assumption of a joint a priori distribution of the effect estimate and the standard error. The former is assumed normal and the latter inverse Gamma.
Simulated cDNA data
The normal distributions simulated defined by their means and standard deviations
Oligonucleotide leukemia data
Ranking based on the leukemia data 
Affymetrix spiking experiment data
Affymetrix spike experiment
M, N, O, P
Q, R, S, T
Whether to look at data on a log scale or not is a tricky question, and is beyond the scope of this article. However, the best performance by the tests considered was achieved when data were lognormal (see Additional data files). Normal, lognormal and real-life data were all included in order to supply a varied testing ground.
As pointed out in , the Bayes statistic is for ranking purposes equivalent to a penalized t-statistic t p = (mean1 - mean2)/√(a1 + S2). Here a1 is a scale parameter related to the a priori distribution of the standard error. This means that it is, at least in form, closely related to the t-test, SAM and samroc. SAM, on the other hand, chooses as its fudge constant the value among the percentiles of S, which minimizes the coefficient of variation of the median absolute deviation of the test statistic computed over a number of percentiles of S . It is interesting to note how different the three related statistics the Bayes method, SAM and samroc turn out in practice.
One clue to why this difference occurs emerges when comparing the denominators of SAM/samroc and Bayes more closely. First square the denominators of (1) and the representation of Bayes above. We obtain (a + S)2 = a2 + 2aS + S2 for (1) and a1 + S2 for Bayes (where generally a1 ≥ a2). For large values of S the former will exceed the latter. This means that SAM/samroc will downplay the importance of the results for high expressing genes in a way that the Bayes method does not.
But there is also another difference. The Bayes method seems to achieve best when the number of false positives is allowed to grow rather large. The constant a corresponds to a large percentile in the distribution of the S2 values (see Additional data files). Whereas the constant in SAM will generally be rather small, often the 5-10% percentile of the S values, the constant in the Bayes method will correspond to at least the 40% percentile of the S2 values. It seems that using a large percentile will give a good performance when the number of false positives grows large. This observation is consistent with the observation made in Lonnstedt and Speed  that the particular version of SAM, which always uses the 90% percentile, will pass the Bayes method when the number of false positives is allowed to grow large. Also, samroc will in general make use of a smaller percentile, albeit that samroc shows greater spread between datasets in the values chosen, as a result of its adaptation to the features specific to the data at hand.
Samroc is the only method that makes explicit use of the number of changed genes in the ranking. If one has reason to believe, for example from studying expression (3), that there are very few DEGs (<< 1%), then samroc is probably not the first choice. Probably SAM or the Bayes method is more useful in these situations. If, on the other hand, the number of DEGs is reasonably large, samroc is conjectured to take precedence over SAM, and to be more robust than the Bayes method. Furthermore, one can argue that the kind of experiments undertaken in drug discovery would more often than not end in comparisons in which the biological systems show vast differences in a large number of genes, mostly as a downstream effect of some shock to the system.
The proposed method comes out better than or as good as the original SAM statistic in most tests performed. The samroc statistic is robust and flexible in that it can address all sorts of problems that suit a linear model. The methodology adjusts the fudge constant flexibly and achieves an improved performance. The algorithm gives fewer false positives and fewer false negatives in many situations, and was never much worse than the best test statistic in any circumstance. However, a typical run with real-life data will take several hours on a desktop computer. To make this methodology better suited for production it would be a good investment to translate part of the R code, or the whole of it, into C.
To improve on standard univariate tests one must make use of the fact that data are available on a large number of related tests. One way of achieving this goal has been shown in this paper. The conclusion is that it is possible and sensible to calibrate the test with respect to estimates of the false-positive and false-negative rates.
Additional data files
A zip file (Additional data file 1) containing the R package SAG for retrieval, preparation and analysis of data from the Affymetrix GeneChip and the R script (Additional data file 2) are available with the online version of this article. An appendix (Additional data file 3) giving further details of the statistical methods and the samroc algorithm is also available as a PDF file.
Ingrid Lonnstedt generously made the code to a linear models version of stat.bay.est available. Comments from Terry Speed are gratefully acknowledged, among them the suggestion to use spike data and ROC curves in the evaluation, as is the input from Brian Middleton and Witte Koopmann.
- Lonnstedt I, Speed TP: Replicated microarray data. Stat Sinica. 2002, 12: 31-46.Google Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.PubMedPubMed CentralView ArticleGoogle Scholar
- Chu G, Narasimhan B, Tibshirani R, Tusher VG: SAM Version 1.12: user's guide and technical document. [http://www-stat.stanford.edu/~tibs/SAM/]
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics. 2001, 17: 509-519. 10.1093/bioinformatics/17.6.509.PubMedView ArticleGoogle Scholar
- The Comprehensive R Archive Network. [http://www.cran.r-project.org]
- Ihaka R, Gentleman R: R: a language for data analysis and graphics. J Comput Graph Stat. 1996, 5: 299-314.Google Scholar
- Supplementary files: SAG and simulation script. [http://home.swipnet.se/pibroberg]
- Lovell DR, Dance CR, Niranjan M, Prager RW, Dalton KJ: Ranking the effect of different features on the classification of discrete valued data. In Engineering Applications of Neural Networks. 1996, Kingston on Thames, London, 487-494.Google Scholar
- Genovese C, Wasserman L: Operating characteristics of the FDR procedure. 2001, Technical Report. New York, Carnegie Mellon UniversityGoogle Scholar
- Dudoit S, Yang YH, Speed TP, Callow MJ: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sinica. 2002, 12: 111-140.Google Scholar
- Davison AC, Hinkley DV: Bootstrap Methods and their Application. 1997, Cambridge, UK: Cambridge University PressView ArticleGoogle Scholar
- Storey JD: A direct approach to false discovery rates. 2001, Technical Report. Stanford, CA: Stanford UniversityGoogle Scholar
- Efron B, Tibshirani R, Storey JD, Tusher VG: Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc. 2001, 96: 1151-1160. 10.1198/016214501753382129.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995, 57: 963-971.Google Scholar
- Lehmann EL: Nonparametrics: Statistical Methods Based on Ranks. 1975, San Francisco, CA: Holden-DayGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.PubMedView ArticleGoogle Scholar
- Whitehead Institute Center for Genome Research: Cancer Genomics Publications Data Sets. [http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi]
- Speed Group Microarray Page - Affymetrix data analysis. [http://www.stat.berkeley.edu/users/terry/zarray/Affy]
- Affymetrix. [http://www.affymetrix.com]
- Smyth GK, Yang YH, Speed TP: Statistical issues in cDNA microarray data analysis. [http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html]
- Bioconductor software for bioinformatics. [http://www.bioconductor.org]
- Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM: Microarray expression profiling identifies genes with altered expression in hdl-deficient mice. Genome Res. 2000, 10: 2022-2029. 10.1101/gr.10.12.2022.PubMedPubMed CentralView ArticleGoogle Scholar
- Arfin SM, Long AD, Ito T, Tolleri L, Riehle MM, Paegle ES, Hatfield GW: Global gene expression profiling in Escherichia coli K12: the effect of integration host factor. J Biol Chem. 2000, 275: 29672-29684. 10.1074/jbc.M002247200.PubMedView ArticleGoogle Scholar
- Lehmann EL: Testing Statistical Hypothesis. 1959, New York: WileyGoogle Scholar
- Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002, 18: 546-554. 10.1093/bioinformatics/18.4.546.PubMedView ArticleGoogle Scholar
- Irizarry RA, Hobbs B, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. [http://www.stat.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/genelogic2001.html]
- DNA-Chip Analyzer (dChip). [http://www.dchip.org]
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci. 2001, 98: 31-36. 10.1073/pnas.011404098.PubMedPubMed CentralView ArticleGoogle Scholar
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol. 2001, 8: 37-52. 10.1089/106652701300099074.PubMedView ArticleGoogle Scholar
- Tsodikov A, Szabo A, Jones D: Adjustments and measures of differential expression for microarray data. Bioinformatics. 2002, 18: 251-260. 10.1093/bioinformatics/18.2.251.PubMedView ArticleGoogle Scholar
- Feller W: An Introduction to Probability Theory and Its Applications. 1971, New York: Wiley, 2: 2Google Scholar
- SMAWEHI: An R Library for statistical microarray analysis. [http://bioinf.wehi.edu.au/smawehi/index.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.