Ranking genes with respect to differential expression

Broberg, Per

doi:10.1186/gb-2002-3-9-preprint0007

Deposited research article
Published: 05 August 2002

Ranking genes with respect to differential expression

Per Broberg¹

Genome Biology volume 3, Article number: preprint0007.1 (2002) Cite this article

9977 Accesses
10 Citations
Metrics details

Abstract

Background

In the pharmaceutical industry and in academia substantial efforts are made to make the best use of the promising microarray technology. The data generated by microarrays are more complex than most other biological data attracting much attention at this point. A method for finding an optimal test statistic with which to rank genes with respect to differential expression is outlined and tested. At the heart of the method lies an estimate of the false negative and false positive rates. Both investing in false positives and missing true positives lead to a waste of resources. The procedure sets out to minimise these errors. For calculation of the false positive and negative rates a simulation procedure is invoked.

Results

The method outperforms commonly used alternatives when applied to simulated data modelled after real cDNA array data as well as when applied to real oligonucleotide array data. In both cases the method comes out as the over-all winner. The simulated data are analysed both exponentiated and on the original scale, thus providing evidence of the ability to cope with normal and lognormal distributions. In the case of the real life data it is shown that the proposed method will tend to push the differentially expressed genes higher up on a test statistic based ranking list than the competitors.

Conclusions

The approach of making use of information concerning both the false positive and false negative rates in the inference adds a useful tool to the toolbox available to scientists in functional genomics.

Background

The microarray technology has revolutionized modern biological research by permitting the simultaneous study of a great part of the genome. The blessings stemming from this also brings the curse of high dimensionality of the data output. Microarrays play an important role in finding drug targets. This application provides the primary practical motivation for the method presented.

The main objective of this article is to explore one method for ranking genes in order of likelihood of being differentially expressed. Since it is the ranking that is of main interest, issues such as calculation of p-values and correction for multiple tests play a secondary role. Rather it is the thinking expressed in [1] that guide much drug target identification research: "The number of genes selected would depend on the size, aim, background and follow-up plans of the experiment". Often interest is restricted to some so-called drugable class of targets, thus thinning out the set eligible genes considerably. Furthermore, the fact that a group of genes in the same functional class show evidence of differential expression may be much more telling than the individual p-values as such. However the rank order does play a role: It is generally sensible to validate a target in the drugable class with a smaller p-value prior to proceeding to one with a larger p-value. Sometimes it is possible to be guided by the performance of known drug targets in the choice of cut-off, but at any rate p-values have the greatest impact on decisions by providing a preliminary ranking of the genes. This is not to say that one should never take multiplicity into account or that this method in some way replaces correction for multiplicity. On the contrary samroc provides the basis for such calculations, see Section Samroc.

The approach presented could be applied to different types of test statistics, but to fix ideas one particular type recently proposed will be used. In the [2] a methodology based on a regularised t-statistic is described:

where diff is an effect estimate, e.g. a group mean difference, S is a standard error, and S₀ is the regularising constant. This formulation is quite general and e.g. includes the estimation of a contrast in an ANOVA. Putting S₀ = 0 will yield a t-statistic. The constant is found by removing the trend in d as a function of S in moving windows across the data. The technical details are spelled out in [3]. The statistic calculated this way will be referred to as SAM.

The basic idea with d is to eliminate some false positives with low values on S. It seems more relevant to optimise with respect to what is really the issue, namely the false positive and false negative rates. This is the intuition behind the approach.

An alternative to the statistic (1) is

or for some weight w, which is basically the statistic proposed in [4]. Its performance appears to be very similar to that of (1) (data not shown). A more imaginative, but rather unorthodox, approach to comparing two groups would be to use m = Σe^-aXi/n, where X_i is assumed positive, for a fold change calculation of the measures m₁ and m₂ for the two groups: MG = max{m₁, m₂}/min{m₁, m₂}. The statistic m estimates the moment generating function. The case a = 1 will be evaluated.

In this article another goodness criterion is proposed and argued. What is really relevant in choosing a good method is choosing one with an attractive Receiver Operating Characteristics (ROC) curve. By an attractive ROC curve is meant a plot of Proportion False Positives against Proportion False Negative that is as close to the axes as possible. This will minimise the number of genes that are falsely declared positive and falsely declared negative for a given significance level α and value on S₀. It will also minimise the distance to the origin, which will be the criterion for finding an optimal value on S₀ as well as on α, see Fig. 2.

A software implementation in R code [5,6] is available at the supplementary web page [7]. The R package SAG contains the function samroc, which provides an implementation of the method.

The structure of the article is the following. First the criterion is explained in detail. Then the estimation problems concerning the false positive and false negative rates are solved. Finally, the algorithm is outlined and tested on some simulated and some real data.

Methods

The criterion

A comparison of methods in terms of their ROC curves is displayed in Figure 4 of [1]. There a method whose ROC curve lies below another one is preferred, see Figure 2. If we agree that it be sensible to compare methods with respect to their ROC curves, then estimation procedures ought to find parameter estimates that make the ROC curve optimal in some sense. This section suggests a goodness criterion for the ROC curve.

By False Discovery Rate (FDR) we mean the proportion of false positives among the significant genes, seer e.g. [2]. Multiplying FDR by the proportion of genes that are declared significant and dividing by the number of genes we obtain the False Positive rate (FP). Similarly we define the False Negative rate (FN).

Assume that we can, for given significance level α, estimate FP(α) and FN(α), and let h be a spline approximating the regression of FP on FN. Let us require that FP is less than or equal to FPmax, and likewise FN less than or equal to FNmax. Furthermore, put h₁(x) = min{h(x), FP_max}. The goodness criterion is then formulated in terms of the distance of points on the curve h₁ to the origin, which in mathematical symbols may be put as

Given a value on S₀ a set of (FP(α), FN(α)) will come out, and from this a spline h is given. Finally, calculate the criterion (2). Repeat the procedure for a number of S₀ and α values and choose the combination with the smallest distance.

If one has an assessment regarding the relative importance of FP and FN, that may be reflected in a version of the criterion (2) that incorporates weights.

Other goodness criteria are possible such as the sum of FP and FN or the area under the curve in Figure 2. For more details and other approaches see e.g. [8,9].

Estimating FP

Using the permutation method to simulate the null distribution of no change we can obtain a p-value for a two-sided test, as detailed below.

The data matrix has genes in rows and arrays in columns. Let d(j)^*k be the value of the jth gene statistic in the kth permutation of columns and the p-value for gene i equals

where M is the number of genes, d(i) the observed statistic for gene i, and B the number of permutations [2,10,11].

Thus given the significance level α the genes considered as differentially expressed will have the proportion given by

, where '#' denotes the cardinality of the set.

According to [12] FDR ≤ α/p(α); equality is assumed in their treatment. Furthermore, there is a bound by Benjamini and Hochberg [13] that states that FDR ≤ M × P_(i)/i, where P_(i) is the largest ordered p-value which is declared significant. In terms of the entities above we have i = p(α) × M and P_(i) = max_i {P_i:P_i ≤ α}. Since P_(i) is the largest p-value called significant, we have P_(i) ≤ α. In [14] the estimate FDR = p₀ × P_(i)/p(α) is derived, where p₀ is the proportion unchanged genes. This suggests that the bound

could be used as a reasonable approximation of FDR.

The current version of samroc uses the estimate

where q_X is the X% fractile of the d^*, cf. [3]. This estimate makes use of the fact that the genes whose test statistics fall in the quartile range will be predominantly the unchanged ones. More material on this matter follows in the Appendix.

The proportion false positive equals the proportion of genes called significant times the false discovery rate, or in symbols:

FP = p₂₂ = p(α) × FDR.

Although not perfect the estimate of FP will tend to follow the change in true FP quite well, see Table 2.

Table 2 The actual and estimated FP's and FN's for the simulated data in the Results section. Often FP is quite well estimated, while the estimate FN^{^} seems to have a bias, but nevertheless captures the changes that takes place.

Full size table

Estimating FN

One way to attack the problem of estimating FN would be the following.

By definition the proportion false negative consists of all genes minus the true negative and the positive.

Using the notation in Table 1FN = p₂₁. From this, we may proceed to p₀ = p₁₁ + p₂₂. Furthermore, p(α) = p₁₂ + p₂₂ and p = FP₂₂. Thus p₁₁ = p₀ - p₂₂. Finally p₂₁ is identified from p₂₁ = 1 - p₁₁ - p₁₂ - p₂₂.

Table 1 The unknown distribution of true and false positives and negatives. The proportion of incorrectly called genes equals p₂₁ + p₂₂.

Full size table

Using (5) one obtains

The graph we want to study is the one of the estimate

versus the estimate of

FP = p(α) × FDR

The Samroc algorithm

The statistic (1) calculated in the following way will be referred to as samroc, and comes with SAG 0.9-13.

Before the algorithm starts the S's are smoothed as a function of the average expression level of the gene, if the option smooth = T. By default smooth = F.

The algorithm suggested can be summarised by the following steps

1.
Calculate the effect estimate, e.g. difference between group means
2.
Calculate the standard errors
3.
Generate effect estimates under the null hypothesis, through B permutations (by default B = 100).
4.
Calculate standard errors for these simulation estimates
5.
Calculate the test statistic (1) for a given value of S₀ for all previous steps, 1, 2, 3 and 4.
6.
Iterate over new values of S₀, i.e. a number of fractiles of S, and a number of α 's to find an optimum.

Samroc also outputs a suggested significance level as well as an optimal S₀, that is estimated to minimise the criterion (1). Furthermore, the function outputs the observed test statistic, the simulation test statistic and the uncorrected p-values. Thus, one may use this output to calculate corrected p-values, e.g. by using the package multtest [15] to obtain the multiple test procedure detailed in [10] (see also [16]), or samfdr in SAG to obtain the decisions based on the FDR [2]. Additionally, the R base package offers the function p. adjust, which provides the multiple test options Bonferroni, Holm and Hochberg.

Since the p-values they produce are monotone in the uncorrected ones they gives genes the same rank order as the uncorrected p-values. One shall bear in mind that the multiple test procedures tend to be very strict, and thus provide low power for detecting changed genes. New and promising ideas based on the FDR concerning how to increase power exist, e.g. [14]. The subject is important and complex, and would easily fill up an article this size.

A modified algorithm which starts by fixing the number of genes to be selected will be evaluated in the future.

Results

When testing methods in this field it is difficult to find suitable data where something is known about true status of the genes. If one chooses to simulate, then the distributions may not be entirely representative of a real life situation. If can find non-proprietary real life data, then the knowledge as to which genes are truly changed may be uncertain. The bulk of my experience with this method comes from analysing proprietary data from experiments run on the Affymetrix U95 GeneChips. I have had very good results, but my telling so will not convince the critical mind, so I have tried to find both relevant simulation models and relevant publicly available real life data.

It has been common to simulate data, as in [4,1,14]. All these use normal distributions, in the case of [1] conditional normal distributions. The data used in this article were simulated using mixtures of the distributions described in [4]. These theoretical distributions were modelled after the real E. coli cDNA data presented in [17]. To make the situation a little more realistic the distributions of the changed genes are now chosen randomly among the three last bottom rows of Table 3. The null distribution was obtained by mixing the distributions in the first three rows of Table 3. The simulation program, including the seed, is available at the supplementary web site [7].

Table 3 The normal distributions simulated, defined by their means and standard deviations. The first three rows do not represent differential expression, while the last three rows do.

Full size table

Ten thousand genes were simulated representing two groups of either different or identical distributions. The groups were of size four. Here we compare the number of false and true positives among the top 500 ranked genes. Choosing the number 500 is of course arbitrary, but resembles the real life situation were we have set aside resources to follow up on a certain number of genes, and looking at 5% is not unreasonable. The number one would follow up is smaller, but may vary depending on what molecular classes are represented. Choosing a much larger number than 500 will make the performance of the tests more equal and the comparisons less relevant, and choosing a much smaller will make comparisons between methods very uncertain.

In the comparison samroc, t-test, Wilcoxon, MG, Fold Change, the Bayesian method in [1], and SAM [2] were competing. By the t-test we mean the unequal variance t-test: for sample means mean₁ and mean₂ and sample variances S₁², S₂². The Wilcoxon rank sum test is based on the sum W_s of the ranks of the observations in one of the groups W_s = R₁... + R_n1 [18]. The Bayesian method calculates the posterior odds for genes being changed (available as functions stat.bay.est in the R package SAG, and stat.bayesian in sma ([1],[5]). Finally, Fold Change equals FC = max{mean₁, mean₂}/min{mean₁, mean₂}.

The t-test is known to have an important optimality feature called Uniformly Most Powerful unbiased when data are normal [19]. Basically, this means that the t-test is hard to beat when data are normal. However, when the number of observations is low the estimate of the standard error (SE) which goes into the denominator can sometimes play tricks. When the SE is low there is an increased risk of obtaining false positives as has been noted by several authors. This problem predominantly appears at low expression levels.

Let us start by the simulated data that is expected to contain 5% changed genes, see Table 4. When data are truly normal and there are as many as 4 observations per group samroc performs best, followed by SAM together with Wilcoxon, and the MG method last, being a total disaster. The program estimates p₀ at 96%, the regularising constant S₀ is set to 0, and the suggested cut-off is 0.02.

Table 4 Genes were ranked with the statistical methods in terms of degree of differential expression, and the number of true and false positives among the top 500 was counted. Above are the number of true and false positives in the top 500 when the proportion unchanged equals 5% and the number of genes equals 10000.

Full size table

When the same data is exponentiated, thus giving lognormal distributions, MG comes out first, followed by samroc, the t-test and Wilcoxon lag far behind, and Fold Change is clearly a failure. This time p₀ is estimated at 96%, S₀ is taken to be the minimum S, and the suggested cut-off is 0.02.

When data are normal and the expected proportion of changed genes equals 10%, p₀ is estimated at 91%, 0 is chosen as the value of S₀, the suggested cut-off is 0.03. Now samroc comes out on top, followed by SAM and the t-test, and again MG cannot cope with normal data, see Table 5.

Table 5 False postives and true positives in top 500, when the proportion unchanged equals 10% and the number of genes equals 10000.

Full size table

These data were also exponentiated and p₀ was then estimated at 91%, and S₀ was chosen to be the minimum of the S's, while the cut-off was set to 0.03. The MG method came out a clear winner, followed by samroc and Bayes, and Fold Change trailing behind.

Next let us look at the leukaemia data from [20] which consists of 38 samples run on the Affymetrix Hu6800FL oligonucleotide chip. The Average Difference values on 7129 probe sets were downloaded from [21]. The samples either belong to the acute myeloid leukaemia (ALL) or the acute lymphoblastic leukaemia (AML) category, with 27 replicates of the first category and 11 of the second. A review of how three methods fare with these data is presented in [22]. In that reference 50 genes are listed that based on statistical analysis of the full set of 38 samples and on biological evidence are believed to be differentially expressed when comparing ALL to AML. With the full sample the results agree well between methods, and there is reason to believe that the genes are truly differentially expressed. With a large sample size, the choice of method is not so critical as with a small. However, a good method would pick out these genes already at a smaller sample size. Therefore, it is reasonable to score the methods by the average rank of the genes on the list.

The data are pre-processed by subtracting the median and dividing by its quartile range as in [22]. Other, possibly more efficient alternatives exist, especially if the intensities are available, see e.g. [23]. But this normalisation is sufficient for the current treatise, where the relative merits of the methods for ranking genes is the issue.

In Table 6 we can see the ranks of the genes among all genes. In the first two columns we see the t-test and SAM, and it is evident that they agree quite well on most genes, which is not so surprising, since SAM is a slightly modified t-test. Looking at the over-all average ranks in the bottom row reveals that samroc tends to push the differentially expressed genes higher up the list than the other methods. Though its apparent similarity to SAM, samroc has a different behaviour. The Bayes method may have some problems caused by the assumption of conditional normality at this sample size with these data.

Table 6 Results for Leukemia data using only the first four samples from ALL and AML. For the full results see the supplementary web page.

Full size table

Discussion

Whether to look at data on a log scale or not is a tricky question, and is beyond the scope of this article. However, from the above it is appears that the best performance by the tests considered is achieved when data are lognormal. But the methods are tested on normal, lognormal and real life data, in order to supply a varied testing ground.

The proposed method comes out better than the original SAM statistic in every test performed. Obviously, the ordinary Fold Change is a disaster, as has been noted by several authors before. The success of MG is rather unexpected and hard to understand, and on top of that the statistic corresponds to a very general hypothesis. But the fact remains that it is a tough contender when data are close to lognormal, which is often the case. In contrast to samroc, however, it suffers from being highly sensitive to distributional assumptions. Maybe a calibration of a using the algorithm outlined in this article can further extend its use. The samroc statistic d is robust and flexible in that it can address all sorts of problems that suit a linear model. The methodology adjusts the regularising constant when data are non-normal and achieves an improved performance. The algorithm ranks the genes in a reliable fashion, and also gives some rough idea of how many genes it makes sense to look closer at.

A typical run with real life data will take several hours on a desktop computer. To make this methodology better suited for production it would be a good investment to translate part of the R code, or the whole of it, into C.

In order to improve on standard univariate tests one must make use of the fact that data are available on a large number of related tests. In this article it has been shown one way of achieving this goal. The conclusion is that it is possible and sensible to calibrate the test with respect to estimates of the false positive and false negative rates.

Appendix

Estimation of p₀

Using the model in [12] where the distribution is assumed to be a mixture of differentially expressed genes (denoted 1) and rest (denoted 0), such that in terms of densities f(x) = p₁f₁(x) + (1 - p₁) f₀(x), the entries of Table 1 may be estimated given a value on p₁. Let p₀ = 1 - p₁, which is not identifiable without strong parametric assumptions. There exist however a number of suggested solutions. From the relation (3.9) in [12] p₁ ≥ 1 - min{f₀/f}. Similarly one may show that p₁ ≤ f(Z) / f₁(Z) = p₁^*/(1 - p₀^*f₀(Z)/f(Z)), taking p₀^* and p₁^* from the previous relation. The right members of the two previous relations may be estimated by a bootstrap explained in the reference. This procedure basically uses the data and the bootstrapped data to estimate the log odds for having an observation of one type versus the other in an interval and smoothes the log-odds as a function of the test statistic with a spline. A third way to estimate p₀ is to use the assumption that the expected difference under the null hypothesis is zero E₀[d] = 0, and a variance decomposition. From E [d] = μ = p₁ μ₁ and the variance decomposition σ² = Var [d] = p₁(μ₁ - μ)² + p₀ μ² + p₁ σ₁² + p₀ σ₀², we have, disregarding p₁σ₁² and using μ₁ = μ/p₁, the inequality σ² ≥ p₀{μ² (p₀/p₁ + 1) + σ₀²}. The moments are estimated from the sample and the bootstrap, and by assuming equality, and thereby inflating p₀ a bit, we can solve for p₀. Assuming equality in the first of the two previous inequalities, using the result in the second and assuming equality also there, and taking the mean of the three estimates,

In a short series of simulation tests the estimate (3) performed best. But other options will be considered in preparation for the next release of samroc.

Why it is not enough to control FDR

If one agrees that, in the notation of Table 1, it is desirable to minimise the proportion of incorrectly called genes p₁₂ + p₂₁ (or in general some increasing function of these), then it will not suffice to minimise FDR. The FDR ≈ p₀ × α/p(α), with α the significance level, tends to decrease with decreasing α, but at the same time the power will decrease as well [18]. The real issue then becomes to keep the FDR low at the same time as achieving a reasonable power. In the context of microarrays, however, it is not possible to calculate power since that requires a specification of the alternative hypothesis, which is not practical. Instead of trying the impossible it is better to balance FDR and power by using the information on p₂₁ and p₁₂ that we have through FP and FN.

References

Lonnstedt I, Speed TP: Replicated microarray data. Statistica Sinica. 2002, 12: 31-46.
Google Scholar
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001, 98: 5116-5121. 10.1073/pnas.091062498.
Article PubMed CAS PubMed Central Google Scholar
Chu G, Narasimhan B, Tibshirani R, Tusher VG: SAM Version 1.12 "Significance Analysis of Microarrays" User's Guide and technical document,. 2001
Google Scholar
Baldi P, Long AD: A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-Test and Statistical Inferences of Gene Changes. Bioinformatics. 2001, 17: 509-519. 10.1093/bioinformatics/17.6.509.
Article PubMed CAS Google Scholar
The R project. [htttp://www.cran.r-project.org]
Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299-314.
Google Scholar
Supplementary web page. [http://home.swipnet.se/pibroberg]
Lovell DR, Dance CR, Niranjan M, Prager RW, Dalton KJ: Ranking the effect of different features on the classification of discrete valued data. in Engineering Applications of Neural Networks (Kingston on Thames, London). 1996, 487-494.
Google Scholar
Genovese C, Wasserman L: Operating Characteristics of the FDR procedure. technical report Carnegie Mellon University. 2001
Google Scholar
Dudoit S, Yang YH, Speed TP, Callow MJ: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.,. Statistica Sinica. 2002, 12: 111-140.
Google Scholar
Davison AC, Hinkley DV: Bootstrap Methods and their Application. 1997, Cambridge : Cambridge Univeristy Press
Book Google Scholar
Efron B, Tibshirani R, Storey JD, Tusher VG: Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001, 96: 1151-1160. 10.1198/016214501753382129.
Article Google Scholar
Benjamini Y, Hochberg Y: Controlling the false discovery rate : a practical and powerful approach to multiple testing. JR Statist Soc B. 1995, 57: 963-971.
Google Scholar
Storey JD: A Direct Approach to False Discovery Rates. technical report Stanford. 2001
Google Scholar
Bioconductor software for bioinformatics. [http://www.bioconductor.org]
Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM: Microarray expression profiling identifies genes with altered expression in hdl-deficient mice. Genome Res. 2000, 10 (12): 2022-9. 10.1101/gr.10.12.2022.
Article PubMed CAS PubMed Central Google Scholar
Arfin SM, Long AD, Ito T, Tolleri L, Riehle MM, Paegle ES, Hatfield GW: Global gene expression profiling in escherichia coli K12: the effect of integration host factor. J Biol Chem. 2000, 275: 29672-29684. 10.1074/jbc.M002247200.
Article PubMed CAS Google Scholar
Lehmann EL: Nonparametrics : Statistical Methods based on Ranks. 1975, San Francisco, Holden-Day
Google Scholar
Lehmann EL: Testing Statistical Hypothesis. 1959, New York: Wiley
Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al: Molecular classification of cancer:class discovery and class prediction by gene expression monitoring. Science. 1999, 285: 531-537. 10.1126/science.286.5439.531.
Article Google Scholar
Dataset used in [20]. [http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi]
Pan W: A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002, 18: 456-554. 10.1093/bioinformatics/18.4.546.
Article Google Scholar
Irizarry RA, Hobbs B, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Manuscript in preparation. 2001, [http://www.stat.berkeley.edu/users/terry/zarray/Affy/GL_Workshop/genelogic2001.html]
Google Scholar

Download references

Acknowledgement

Ingrid Lonnstedt generously made the code to a linear models version of stat.bay.est available. Furthermore, the comments from Terry Speed are gratefully acknowledged.

Author information

Authors and Affiliations

Molecular Sciences, AstraZeneca R&D Lund, S-221 87, Lund, Sweden
Per Broberg

Authors

Per Broberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Per Broberg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Broberg, P. Ranking genes with respect to differential expression. Genome Biol 3, preprint0007.1 (2002). https://doi.org/10.1186/gb-2002-3-9-preprint0007

Download citation

Received: 31 July 2002
Published: 05 August 2002
DOI: https://doi.org/10.1186/gb-2002-3-9-preprint0007

Ranking genes with respect to differential expression