 Deposited research article
 Published:
Ranking genes with respect to differential expression
Genome Biology volume 3, Article number: preprint0007.1 (2002)
Abstract
Background
In the pharmaceutical industry and in academia substantial efforts are made to make the best use of the promising microarray technology. The data generated by microarrays are more complex than most other biological data attracting much attention at this point. A method for finding an optimal test statistic with which to rank genes with respect to differential expression is outlined and tested. At the heart of the method lies an estimate of the false negative and false positive rates. Both investing in false positives and missing true positives lead to a waste of resources. The procedure sets out to minimise these errors. For calculation of the false positive and negative rates a simulation procedure is invoked.
Results
The method outperforms commonly used alternatives when applied to simulated data modelled after real cDNA array data as well as when applied to real oligonucleotide array data. In both cases the method comes out as the overall winner. The simulated data are analysed both exponentiated and on the original scale, thus providing evidence of the ability to cope with normal and lognormal distributions. In the case of the real life data it is shown that the proposed method will tend to push the differentially expressed genes higher up on a test statistic based ranking list than the competitors.
Conclusions
The approach of making use of information concerning both the false positive and false negative rates in the inference adds a useful tool to the toolbox available to scientists in functional genomics.
Background
The microarray technology has revolutionized modern biological research by permitting the simultaneous study of a great part of the genome. The blessings stemming from this also brings the curse of high dimensionality of the data output. Microarrays play an important role in finding drug targets. This application provides the primary practical motivation for the method presented.
The main objective of this article is to explore one method for ranking genes in order of likelihood of being differentially expressed. Since it is the ranking that is of main interest, issues such as calculation of pvalues and correction for multiple tests play a secondary role. Rather it is the thinking expressed in [1] that guide much drug target identification research: "The number of genes selected would depend on the size, aim, background and followup plans of the experiment". Often interest is restricted to some socalled drugable class of targets, thus thinning out the set eligible genes considerably. Furthermore, the fact that a group of genes in the same functional class show evidence of differential expression may be much more telling than the individual pvalues as such. However the rank order does play a role: It is generally sensible to validate a target in the drugable class with a smaller pvalue prior to proceeding to one with a larger pvalue. Sometimes it is possible to be guided by the performance of known drug targets in the choice of cutoff, but at any rate pvalues have the greatest impact on decisions by providing a preliminary ranking of the genes. This is not to say that one should never take multiplicity into account or that this method in some way replaces correction for multiplicity. On the contrary samroc provides the basis for such calculations, see Section Samroc.
The approach presented could be applied to different types of test statistics, but to fix ideas one particular type recently proposed will be used. In the [2] a methodology based on a regularised tstatistic is described:
where diff is an effect estimate, e.g. a group mean difference, S is a standard error, and S_{0} is the regularising constant. This formulation is quite general and e.g. includes the estimation of a contrast in an ANOVA. Putting S_{0} = 0 will yield a tstatistic. The constant is found by removing the trend in d as a function of S in moving windows across the data. The technical details are spelled out in [3]. The statistic calculated this way will be referred to as SAM.
The basic idea with d is to eliminate some false positives with low values on S. It seems more relevant to optimise with respect to what is really the issue, namely the false positive and false negative rates. This is the intuition behind the approach.
An alternative to the statistic (1) is
or for some weight w, which is basically the statistic proposed in [4]. Its performance appears to be very similar to that of (1) (data not shown). A more imaginative, but rather unorthodox, approach to comparing two groups would be to use m = Σe^{aXi}/n, where X_{ i } is assumed positive, for a fold change calculation of the measures m_{1} and m_{2} for the two groups: MG = max{m_{1}, m_{2}}/min{m_{1}, m_{2}}. The statistic m estimates the moment generating function. The case a = 1 will be evaluated.
In this article another goodness criterion is proposed and argued. What is really relevant in choosing a good method is choosing one with an attractive Receiver Operating Characteristics (ROC) curve. By an attractive ROC curve is meant a plot of Proportion False Positives against Proportion False Negative that is as close to the axes as possible. This will minimise the number of genes that are falsely declared positive and falsely declared negative for a given significance level α and value on S_{0}. It will also minimise the distance to the origin, which will be the criterion for finding an optimal value on S_{0} as well as on α, see Fig. 2.
A software implementation in R code [5,6] is available at the supplementary web page [7]. The R package SAG contains the function samroc, which provides an implementation of the method.
The structure of the article is the following. First the criterion is explained in detail. Then the estimation problems concerning the false positive and false negative rates are solved. Finally, the algorithm is outlined and tested on some simulated and some real data.
Methods
The criterion
A comparison of methods in terms of their ROC curves is displayed in Figure 4 of [1]. There a method whose ROC curve lies below another one is preferred, see Figure 2. If we agree that it be sensible to compare methods with respect to their ROC curves, then estimation procedures ought to find parameter estimates that make the ROC curve optimal in some sense. This section suggests a goodness criterion for the ROC curve.
By False Discovery Rate (FDR) we mean the proportion of false positives among the significant genes, seer e.g. [2]. Multiplying FDR by the proportion of genes that are declared significant and dividing by the number of genes we obtain the False Positive rate (FP). Similarly we define the False Negative rate (FN).
Assume that we can, for given significance level α, estimate FP(α) and FN(α), and let h be a spline approximating the regression of FP on FN. Let us require that FP is less than or equal to FPmax, and likewise FN less than or equal to FNmax. Furthermore, put h_{1}(x) = min{h(x), FP_{ max }}. The goodness criterion is then formulated in terms of the distance of points on the curve h_{1} to the origin, which in mathematical symbols may be put as
Given a value on S_{0} a set of (FP(α), FN(α)) will come out, and from this a spline h is given. Finally, calculate the criterion (2). Repeat the procedure for a number of S_{0} and α values and choose the combination with the smallest distance.
If one has an assessment regarding the relative importance of FP and FN, that may be reflected in a version of the criterion (2) that incorporates weights.
Other goodness criteria are possible such as the sum of FP and FN or the area under the curve in Figure 2. For more details and other approaches see e.g. [8,9].
Estimating FP
Using the permutation method to simulate the null distribution of no change we can obtain a pvalue for a twosided test, as detailed below.
The data matrix has genes in rows and arrays in columns. Let d(j)^{*k} be the value of the jth gene statistic in the kth permutation of columns and the pvalue for gene i equals
where M is the number of genes, d(i) the observed statistic for gene i, and B the number of permutations [2,10,11].
Thus given the significance level α the genes considered as differentially expressed will have the proportion given by
, where '#' denotes the cardinality of the set.
According to [12] FDR ≤ α/p(α); equality is assumed in their treatment. Furthermore, there is a bound by Benjamini and Hochberg [13] that states that FDR ≤ M × P_{(i)}/i, where P_{(i)} is the largest ordered pvalue which is declared significant. In terms of the entities above we have i = p(α) × M and P_{(i)} = max_{ i } {P_{ i }:P_{ i } ≤ α}. Since P_{(i)} is the largest pvalue called significant, we have P_{(i)} ≤ α. In [14] the estimate FDR = p_{0} × P_{(i)}/p(α) is derived, where p_{0} is the proportion unchanged genes. This suggests that the bound
could be used as a reasonable approximation of FDR.
The current version of samroc uses the estimate
where q_{ X } is the X% fractile of the d^{*}, cf. [3]. This estimate makes use of the fact that the genes whose test statistics fall in the quartile range will be predominantly the unchanged ones. More material on this matter follows in the Appendix.
The proportion false positive equals the proportion of genes called significant times the false discovery rate, or in symbols:
FP = p_{22} = p(α) × FDR.
Although not perfect the estimate of FP will tend to follow the change in true FP quite well, see Table 2.
Estimating FN
One way to attack the problem of estimating FN would be the following.
By definition the proportion false negative consists of all genes minus the true negative and the positive.
Using the notation in Table 1FN = p_{ 21 }. From this, we may proceed to p_{ 0 } = p_{ 11 } + p_{ 22 }. Furthermore, p(α) = p_{ 12 } + p_{ 22 } and p = FP_{ 22 }. Thus p_{11} = p_{0}  p_{22}. Finally p_{ 21 } is identified from p_{ 21 } = 1  p_{ 11 }  p_{ 12 }  p_{ 22 }.
Using (5) one obtains
The graph we want to study is the one of the estimate
versus the estimate of
FP = p(α) × FDR
The Samroc algorithm
The statistic (1) calculated in the following way will be referred to as samroc, and comes with SAG 0.913.
Before the algorithm starts the S's are smoothed as a function of the average expression level of the gene, if the option smooth = T. By default smooth = F.
The algorithm suggested can be summarised by the following steps

1.
Calculate the effect estimate, e.g. difference between group means

2.
Calculate the standard errors

3.
Generate effect estimates under the null hypothesis, through B permutations (by default B = 100).

4.
Calculate standard errors for these simulation estimates

5.
Calculate the test statistic (1) for a given value of S_{0} for all previous steps, 1, 2, 3 and 4.

6.
Iterate over new values of S_{0}, i.e. a number of fractiles of S, and a number of α 's to find an optimum.
Samroc also outputs a suggested significance level as well as an optimal S_{0}, that is estimated to minimise the criterion (1). Furthermore, the function outputs the observed test statistic, the simulation test statistic and the uncorrected pvalues. Thus, one may use this output to calculate corrected pvalues, e.g. by using the package multtest [15] to obtain the multiple test procedure detailed in [10] (see also [16]), or samfdr in SAG to obtain the decisions based on the FDR [2]. Additionally, the R base package offers the function p. adjust, which provides the multiple test options Bonferroni, Holm and Hochberg.
Since the pvalues they produce are monotone in the uncorrected ones they gives genes the same rank order as the uncorrected pvalues. One shall bear in mind that the multiple test procedures tend to be very strict, and thus provide low power for detecting changed genes. New and promising ideas based on the FDR concerning how to increase power exist, e.g. [14]. The subject is important and complex, and would easily fill up an article this size.
A modified algorithm which starts by fixing the number of genes to be selected will be evaluated in the future.
Results
When testing methods in this field it is difficult to find suitable data where something is known about true status of the genes. If one chooses to simulate, then the distributions may not be entirely representative of a real life situation. If can find nonproprietary real life data, then the knowledge as to which genes are truly changed may be uncertain. The bulk of my experience with this method comes from analysing proprietary data from experiments run on the Affymetrix U95 GeneChips. I have had very good results, but my telling so will not convince the critical mind, so I have tried to find both relevant simulation models and relevant publicly available real life data.
It has been common to simulate data, as in [4,1,14]. All these use normal distributions, in the case of [1] conditional normal distributions. The data used in this article were simulated using mixtures of the distributions described in [4]. These theoretical distributions were modelled after the real E. coli cDNA data presented in [17]. To make the situation a little more realistic the distributions of the changed genes are now chosen randomly among the three last bottom rows of Table 3. The null distribution was obtained by mixing the distributions in the first three rows of Table 3. The simulation program, including the seed, is available at the supplementary web site [7].
Ten thousand genes were simulated representing two groups of either different or identical distributions. The groups were of size four. Here we compare the number of false and true positives among the top 500 ranked genes. Choosing the number 500 is of course arbitrary, but resembles the real life situation were we have set aside resources to follow up on a certain number of genes, and looking at 5% is not unreasonable. The number one would follow up is smaller, but may vary depending on what molecular classes are represented. Choosing a much larger number than 500 will make the performance of the tests more equal and the comparisons less relevant, and choosing a much smaller will make comparisons between methods very uncertain.
In the comparison samroc, ttest, Wilcoxon, MG, Fold Change, the Bayesian method in [1], and SAM [2] were competing. By the ttest we mean the unequal variance ttest: for sample means mean_{1} and mean_{2} and sample variances S_{1}^{2}, S_{2}^{2}. The Wilcoxon rank sum test is based on the sum W_{ s } of the ranks of the observations in one of the groups W_{ s } = R_{1}... + R_{n1} [18]. The Bayesian method calculates the posterior odds for genes being changed (available as functions stat.bay.est in the R package SAG, and stat.bayesian in sma ([1],[5]). Finally, Fold Change equals FC = max{mean_{1}, mean_{2}}/min{mean_{1}, mean_{2}}.
The ttest is known to have an important optimality feature called Uniformly Most Powerful unbiased when data are normal [19]. Basically, this means that the ttest is hard to beat when data are normal. However, when the number of observations is low the estimate of the standard error (SE) which goes into the denominator can sometimes play tricks. When the SE is low there is an increased risk of obtaining false positives as has been noted by several authors. This problem predominantly appears at low expression levels.
Let us start by the simulated data that is expected to contain 5% changed genes, see Table 4. When data are truly normal and there are as many as 4 observations per group samroc performs best, followed by SAM together with Wilcoxon, and the MG method last, being a total disaster. The program estimates p_{0} at 96%, the regularising constant S_{0} is set to 0, and the suggested cutoff is 0.02.
When the same data is exponentiated, thus giving lognormal distributions, MG comes out first, followed by samroc, the ttest and Wilcoxon lag far behind, and Fold Change is clearly a failure. This time p_{0} is estimated at 96%, S_{0} is taken to be the minimum S, and the suggested cutoff is 0.02.
When data are normal and the expected proportion of changed genes equals 10%, p_{0} is estimated at 91%, 0 is chosen as the value of S_{0}, the suggested cutoff is 0.03. Now samroc comes out on top, followed by SAM and the ttest, and again MG cannot cope with normal data, see Table 5.
These data were also exponentiated and p_{0} was then estimated at 91%, and S_{0} was chosen to be the minimum of the S's, while the cutoff was set to 0.03. The MG method came out a clear winner, followed by samroc and Bayes, and Fold Change trailing behind.
Next let us look at the leukaemia data from [20] which consists of 38 samples run on the Affymetrix Hu6800FL oligonucleotide chip. The Average Difference values on 7129 probe sets were downloaded from [21]. The samples either belong to the acute myeloid leukaemia (ALL) or the acute lymphoblastic leukaemia (AML) category, with 27 replicates of the first category and 11 of the second. A review of how three methods fare with these data is presented in [22]. In that reference 50 genes are listed that based on statistical analysis of the full set of 38 samples and on biological evidence are believed to be differentially expressed when comparing ALL to AML. With the full sample the results agree well between methods, and there is reason to believe that the genes are truly differentially expressed. With a large sample size, the choice of method is not so critical as with a small. However, a good method would pick out these genes already at a smaller sample size. Therefore, it is reasonable to score the methods by the average rank of the genes on the list.
The data are preprocessed by subtracting the median and dividing by its quartile range as in [22]. Other, possibly more efficient alternatives exist, especially if the intensities are available, see e.g. [23]. But this normalisation is sufficient for the current treatise, where the relative merits of the methods for ranking genes is the issue.
In Table 6 we can see the ranks of the genes among all genes. In the first two columns we see the ttest and SAM, and it is evident that they agree quite well on most genes, which is not so surprising, since SAM is a slightly modified ttest. Looking at the overall average ranks in the bottom row reveals that samroc tends to push the differentially expressed genes higher up the list than the other methods. Though its apparent similarity to SAM, samroc has a different behaviour. The Bayes method may have some problems caused by the assumption of conditional normality at this sample size with these data.
Discussion
Whether to look at data on a log scale or not is a tricky question, and is beyond the scope of this article. However, from the above it is appears that the best performance by the tests considered is achieved when data are lognormal. But the methods are tested on normal, lognormal and real life data, in order to supply a varied testing ground.
The proposed method comes out better than the original SAM statistic in every test performed. Obviously, the ordinary Fold Change is a disaster, as has been noted by several authors before. The success of MG is rather unexpected and hard to understand, and on top of that the statistic corresponds to a very general hypothesis. But the fact remains that it is a tough contender when data are close to lognormal, which is often the case. In contrast to samroc, however, it suffers from being highly sensitive to distributional assumptions. Maybe a calibration of a using the algorithm outlined in this article can further extend its use. The samroc statistic d is robust and flexible in that it can address all sorts of problems that suit a linear model. The methodology adjusts the regularising constant when data are nonnormal and achieves an improved performance. The algorithm ranks the genes in a reliable fashion, and also gives some rough idea of how many genes it makes sense to look closer at.
A typical run with real life data will take several hours on a desktop computer. To make this methodology better suited for production it would be a good investment to translate part of the R code, or the whole of it, into C.
In order to improve on standard univariate tests one must make use of the fact that data are available on a large number of related tests. In this article it has been shown one way of achieving this goal. The conclusion is that it is possible and sensible to calibrate the test with respect to estimates of the false positive and false negative rates.
Appendix
Estimation of p_{0}
Using the model in [12] where the distribution is assumed to be a mixture of differentially expressed genes (denoted 1) and rest (denoted 0), such that in terms of densities f(x) = p_{1}f_{1}(x) + (1  p_{1}) f_{0}(x), the entries of Table 1 may be estimated given a value on p_{1}. Let p_{0} = 1  p_{1}, which is not identifiable without strong parametric assumptions. There exist however a number of suggested solutions. From the relation (3.9) in [12] p_{1} ≥ 1  min{f_{0}/f}. Similarly one may show that p_{1} ≤ f(Z) / f_{1}(Z) = p_{1}^{*}/(1  p_{0}^{*}f_{0}(Z)/f(Z)), taking p_{0}^{*} and p_{1}^{*} from the previous relation. The right members of the two previous relations may be estimated by a bootstrap explained in the reference. This procedure basically uses the data and the bootstrapped data to estimate the log odds for having an observation of one type versus the other in an interval and smoothes the logodds as a function of the test statistic with a spline. A third way to estimate p_{0} is to use the assumption that the expected difference under the null hypothesis is zero E_{0}[d] = 0, and a variance decomposition. From E [d] = μ = p_{1} μ_{1} and the variance decomposition σ^{2} = Var [d] = p_{1}(μ_{1}  μ)^{2} + p_{0} μ^{2} + p_{1} σ_{1}^{2} + p_{0} σ_{0}^{2}, we have, disregarding p_{1}σ_{1}^{2} and using μ_{1} = μ/p_{1}, the inequality σ^{2} ≥ p_{0}{μ^{2} (p_{0}/p_{1} + 1) + σ_{0}^{2}}. The moments are estimated from the sample and the bootstrap, and by assuming equality, and thereby inflating p_{0} a bit, we can solve for p_{0}. Assuming equality in the first of the two previous inequalities, using the result in the second and assuming equality also there, and taking the mean of the three estimates,
In a short series of simulation tests the estimate (3) performed best. But other options will be considered in preparation for the next release of samroc.
Why it is not enough to control FDR
If one agrees that, in the notation of Table 1, it is desirable to minimise the proportion of incorrectly called genes p_{12} + p_{21} (or in general some increasing function of these), then it will not suffice to minimise FDR. The FDR ≈ p_{0} × α/p(α), with α the significance level, tends to decrease with decreasing α, but at the same time the power will decrease as well [18]. The real issue then becomes to keep the FDR low at the same time as achieving a reasonable power. In the context of microarrays, however, it is not possible to calculate power since that requires a specification of the alternative hypothesis, which is not practical. Instead of trying the impossible it is better to balance FDR and power by using the information on p_{21} and p_{12} that we have through FP and FN.
References
Lonnstedt I, Speed TP: Replicated microarray data. Statistica Sinica. 2002, 12: 3146.
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 51165121. 10.1073/pnas.091062498.
Chu G, Narasimhan B, Tibshirani R, Tusher VG: SAM Version 1.12 "Significance Analysis of Microarrays" User's Guide and technical document,. 2001
Baldi P, Long AD: A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized tTest and Statistical Inferences of Gene Changes. Bioinformatics. 2001, 17: 509519. 10.1093/bioinformatics/17.6.509.
The R project. [htttp://www.cran.rproject.org]
Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299314.
Supplementary web page. [http://home.swipnet.se/pibroberg]
Lovell DR, Dance CR, Niranjan M, Prager RW, Dalton KJ: Ranking the effect of different features on the classification of discrete valued data. in Engineering Applications of Neural Networks (Kingston on Thames, London). 1996, 487494.
Genovese C, Wasserman L: Operating Characteristics of the FDR procedure. technical report Carnegie Mellon University. 2001
Dudoit S, Yang YH, Speed TP, Callow MJ: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.,. Statistica Sinica. 2002, 12: 111140.
Davison AC, Hinkley DV: Bootstrap Methods and their Application. 1997, Cambridge : Cambridge Univeristy Press
Efron B, Tibshirani R, Storey JD, Tusher VG: Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001, 96: 11511160. 10.1198/016214501753382129.
Benjamini Y, Hochberg Y: Controlling the false discovery rate : a practical and powerful approach to multiple testing. JR Statist Soc B. 1995, 57: 963971.
Storey JD: A Direct Approach to False Discovery Rates. technical report Stanford. 2001
Bioconductor software for bioinformatics. [http://www.bioconductor.org]
Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM: Microarray expression profiling identifies genes with altered expression in hdldeficient mice. Genome Res. 2000, 10 (12): 20229. 10.1101/gr.10.12.2022.
Arfin SM, Long AD, Ito T, Tolleri L, Riehle MM, Paegle ES, Hatfield GW: Global gene expression profiling in escherichia coli K12: the effect of integration host factor. J Biol Chem. 2000, 275: 2967229684. 10.1074/jbc.M002247200.
Lehmann EL: Nonparametrics : Statistical Methods based on Ranks. 1975, San Francisco, HoldenDay
Lehmann EL: Testing Statistical Hypothesis. 1959, New York: Wiley
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer:class discovery and class prediction by gene expression monitoring. Science. 1999, 285: 531537. 10.1126/science.286.5439.531.
Dataset used in [20]. [http://wwwgenome.wi.mit.edu/cgibin/cancer/datasets.cgi]
Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002, 18: 456554. 10.1093/bioinformatics/18.4.546.
Irizarry RA, Hobbs B, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Manuscript in preparation. 2001, [http://www.stat.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/genelogic2001.html]
Acknowledgement
Ingrid Lonnstedt generously made the code to a linear models version of stat.bay.est available. Furthermore, the comments from Terry Speed are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Broberg, P. Ranking genes with respect to differential expression. Genome Biol 3, preprint0007.1 (2002). https://doi.org/10.1186/gb200239preprint0007
Received:
Published:
DOI: https://doi.org/10.1186/gb200239preprint0007
Keywords
 False Discovery Rate
 False Negative Rate
 Receiver Operating Characteristic Curve
 Real Life Data
 Multiple Test Procedure