### Model

#### Description

We assume that the number of reads in sample *j* that are assigned to gene *i* can be modeled by a negative binomial (NB) distribution,

{K}_{ij}~\text{NB}({\mu}_{ij},{\sigma}_{ij}^{2}),

(1)

which has two parameters, the mean *μ*_{ij} and the variance {\sigma}_{ij}^{2}. The read counts *K*_{ij} are non-negative integers. The probabilities of the distribution are given in Supplementary Note A. (All Supplementary Notes are in Additional file 1.) The NB distribution is commonly used to model count data when overdispersion is present [12].

In practice, we do not know the parameters *μ*_{ij} and {\sigma}_{ij}^{2}, and we need to estimate them from the data. Typically, the number of replicates is small, and further modelling assumptions need to be made in order to obtain useful estimates. In this paper, we develop a method that is based on the following three assumptions.

First, the mean parameter *μ*_{ij} , that is, the expectation value of the observed counts for gene *i* in sample *j*, is the product of a condition-dependent per-gene value *q*_{i}, _{ρ}(_{j)}(where *ρ*(*j*) is the experimental condition of sample *j*) and a size factor *s*_{j},

{\mu}_{ij}={q}_{i,\rho {(j)}^{S}j}.

(2)

*q*_{i,ρ(j)}is proportional to the expectation value of the true (but unknown) concentration of fragments from gene *i* under condition *ρ*(*j*). The size factor *s*_{j}represents the coverage, or sampling depth, of library *j*, and we will use the term *common scale* for quantities, such as *q*_{i, ρ(j)}, that are adjusted for coverage by dividing by *s*_{j}.

Second, the variance {\sigma}_{ij}^{2} is the sum of a *shot noise term* and a *raw variance term*,

{\sigma}_{ij}^{2}=\underset{\text{shot}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\text{noise}}{\underset{\u23df}{{\mu}_{ij}}}+\underset{\text{raw}\phantom{\rule{0.5em}{0ex}}\phantom{\rule{0.5em}{0ex}}\text{variance}}{\underset{\u23df}{{s}_{j}^{2}{v}_{i,\rho (j)}}}.

(3)

Third, we assume that the per-gene raw variance parameter *v*_{i, ρ}is a smooth function of *q*_{i}, *ρ*,

{v}_{i,\rho (j)}={v}_{\rho}({q}_{i,\rho (j)}).

(4)

This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene *i* from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.

The decomposition of the variance in Equation (3) is motivated by the following hierarchical model: We assume that the actual concentration of fragments from gene *i* in sample *j* is proportional to a random variable *R*_{ij} , such that the rate that fragments from gene *i* are sequenced is *s*_{j}*r*_{ij} . For each gene *i* and all samples *j* of condition *ρ*, the *R*_{ij} are i.i.d. with mean *q*_{iρ} and variance *v*_{iρ} . Thus, the count value *K*_{ij} , conditioned on *R*_{ij} = *r*_{ij} , is Poisson distributed with rate *s*_{j}*r*_{ij} . The marginal distribution of *K*_{ij} - when allowing for variation in *R*_{ij} - has the mean *μ*_{ij} and (according to the law of total variance) the variance given in Equation (3). Furthermore, if the higher moments of *R*_{ij} are modeled according to a gamma distribution, the marginal distribution of *K*_{ij} is NB (see, for example, [12], Section 4.2.2).

#### Fitting

We now describe how the model can be fitted to data. The data are an *n* × *m* table of counts, *k*_{ij} , where *i* = 1,..., *n* indexes the genes, and *j* = 1,..., *m* indexes the samples. The model has three sets of parameters:

(i) *m* size factors *s*_{j} ; the expectation values of all counts from sample *j* are proportional to *s*_{j} .

(ii) for each experimental condition *ρ*, *n* expression strength parameters *q*_{iρ} ; they reflect the expected abundance of fragments from gene *i* under condition *ρ*, that is, expectation values of counts for gene *i* are proportional to *q*_{iρ} .

(iii) The smooth functions *v*_{ρ} : ℝ^{+} →ℝ^{+}; for each condition *ρ*, *v*_{ρ} models the dependence of the raw variance *v*_{iρ} on the expected mean *q*_{iρ} .

The purpose of the size factors *s*_{j} is to render counts from different samples, which may have been sequenced to different depths, comparable. Hence, the ratios (\mathbb{E}*K*_{ij} )/(\mathbb{E}*K*_{ij'} ) of expected counts for the same gene *i* in different samples *j* and *j'* should be equal to the size ratio *s*_{j} /*s*_{j'} if gene *i* is not differentially expressed or samples *j* and *j'* are replicates. The total number of reads, Σ _{i} *k*_{ij} , may seem to be a good measure of sequencing depth and hence a reasonable choice for *s*_{j} . Experience with real data, however, shows this not always to be the case, because a few highly and differentially expressed genes may have strong influence on the total read count, causing the ratio of total read counts not to be a good estimate for the ratio of expected counts.

Hence, to estimate the size factors, we take the median of the ratios of observed counts. Generalizing the procedure just outlined to the case of more than two samples, we use:

{\widehat{s}}_{j}=\underset{i}{\text{median}}\frac{{k}_{ij}}{{\left({\displaystyle {\prod}_{v=1}^{m}{k}_{iv}}\right)}^{1/m}}.

(5)

The denominator of this expression can be interpreted as a pseudo-reference sample obtained by taking the geometric mean across samples. Thus, each size factor estimate {\widehat{s}}_{j} is computed as the median of the ratios of the *j*-th sample's counts to those of the pseudo-reference. (Note: While this manuscript was under review, Robinson and Oshlack [13] suggested a similar method.)

To estimate *q*_{iρ} , we use the average of the counts from the samples *j* corresponding to condition *ρ*, transformed to the common scale:

{\widehat{q}}_{i\rho}=\frac{1}{{m}_{\rho}}{\displaystyle \sum _{j:\rho (j)=\rho}\frac{{k}_{ij}}{{\widehat{s}}_{j}}},

(6)

where *m*_{ρ} is the number of replicates of condition *ρ* and the sum runs over these replicates. the functions *v*_{ρ} , we first calculate sample variances on the common scale

{w}_{i\rho}=\frac{1}{{m}_{\rho}-1}{\displaystyle \sum _{j:\rho (j)=\rho}{\left(\frac{{k}_{ij}}{{\widehat{s}}_{j}}-{\widehat{q}}_{i\rho}\right)}^{2}}

(7)

and define

{z}_{i\rho}=\frac{{\widehat{q}}_{i\rho}}{{m}_{\rho}}{\displaystyle \sum _{j:\rho (j)=\rho}\frac{1}{{\widehat{s}}_{j}}}.

(8)

In Supplementary Note B in Additional file 1 we show that *w*_{iρ} - *z*_{iρ} is an unbiased estimator for the raw variance parameter *v*_{iρ} of Equation (3).

However, for small numbers of replicates, *m*_{ρ} , as is typically the case in applications, the values *w*_{iρ} are highly variable, and *w*_{iρ} - *z*_{iρ} would not be a useful variance estimator for statistical inference. Instead, we use local regression [14] on the graph ({\widehat{q}}_{i\rho},{w}_{i\rho}) to obtain a smooth function *w*_{ρ} (*q*), with

{\widehat{v}}_{\rho}({\widehat{q}}_{i\rho})={w}_{\rho}({\widehat{q}}_{i\rho})-{z}_{i\rho}

(9)

as our estimate for the raw variance.

Some attention is needed to avoid estimation biases in the local regression. *w*_{iρ} is a sum of squared random variables, and the residuals {w}_{i\rho}-w({\widehat{q}}_{i\rho}) are skewed. Following References [15], Chapter 8 and [14], Section 9.1.2, we use a generalized linear model of the gamma family for the local regression, using the implementation in the *locfit* package [16].

### Testing for differential expression

Suppose that we have *m*_{A} replicate samples for biological condition A and *m*_{B} samples for condition B. For each gene *i*, we would like to weigh the evidence in the data for differential expression of that gene between the two conditions. In particular, we would like to test the null hypothesis *q*_{iA} = *q*_{iB} , where *q*_{iA} is the expression strength parameter for the samples of condition A, and q_{iB} for condition B. To this end, we define, as test statistic, the total counts in each condition,

\begin{array}{cc}{K}_{i\text{A}}={\displaystyle \sum _{j:\rho (j)=\text{A}}{K}_{ij}},& {K}_{i\text{B}}={\displaystyle \sum _{j:\rho (j)=\text{B}}{K}_{ij}},\end{array}

(10)

and their overall sum *K*_{iS} = *K*_{iA} + *K*_{iB} . From the error model described in the previous Section, we show below that - under the null hypothesis - we can compute the probabilities of the events *K*_{iA} = *a* and *K*_{iB} = *b* for any pair of numbers *a* and *b*. We denote this probability by *p*(*a*, *b*). The *P* value of a pair of observed count sums (*k*_{iA} , *k*_{iB} ) is then the sum of all probabilities less or equal to *p*(*k*_{iA} , *k*_{iB} ), given that the overall sum is *k*_{iS} :

{p}_{i}=\frac{{\displaystyle \sum _{\begin{array}{c}a+b={k}_{i\text{S}}\\ p(a,b)\le p({k}_{i\text{A}}{k}_{i\text{B}})\end{array}}p(a,b)}}{{\displaystyle \sum _{a+b={k}_{i\text{S}}}p(a,b)}}.

(11)

The variables *a* and *b* in the above sums take the values 0,..., *k*_{iS}. The approach presented so far follows that of Robinson and Smyth [11] and is analogous to that taken by other conditioned tests, such as Fisher's exact test. (See Reference [17], Chapter 3 for a discussion of the merits of conditioning in tests.)

**Computation of** *p*(*a*, *b*). First, assume that, under the null hypothesis, counts from different samples are independent. Then, *p*(*a*, *b*) = Pr(*K*_{iA}= *a*) Pr(*K*_{iB}= *b*). The problem thus is computing the probability of the event *K*_{iA}= *a*, and, analogously, of *K*_{iB}= *b*. The random variable *K*_{iA}is the sum of *m*_{A}

NB-distributed random variables. We approximate its distribution by a NB distribution whose parameters we obtain from those of the *K*_{ij}. To this end, we first compute the pooled mean estimate from the counts of both conditions,

{\widehat{q}}_{i0}={\displaystyle \sum _{j:\rho (j)\in \{A,B\}}{k}_{ij}}/{s}_{j},

(12)

which accounts for the fact that the null hypothesis stipulates that *q*_{iA}= *q*_{iB}. The summed mean and variance for condition A are

{\widehat{\mu}}_{i\text{A}}={\displaystyle \sum _{j\in \text{A}}{s}_{j}}{\widehat{q}}_{i0},

(13)

{\widehat{\sigma}}_{i\text{A}}^{2}={\displaystyle \sum _{j\in \text{A}}{\widehat{s}}_{j}}{\widehat{q}}_{i0}+{\widehat{s}}_{j}^{2}{\widehat{v}}_{\text{A}}({\widehat{q}}_{i0}).

(14)

Supplementary Note C in Additional file 1 describes how the distribution parameters of the NB for *K*_{iA}can be determined from {\widehat{\mu}}_{i\text{A}} and {\widehat{\sigma}}_{i\text{A}}^{2}. (To avoid bias, we do not match the moments directly, but instead match a different pair of distribution statistics.) The parameters of *K*_{iB}are obtained analogously.

Supplementary Note D in Additional file 1 explains how we evaluate the sums in Equation (11).

### Applications

#### Data sets

We present results based on the following data sets:

##### RNA-Seq in fly embryos

B. Wilczynski, Y.-H. Liu, N. Delhomme and E. Furlong have conducted RNA-Seq experiments in fly embryos and kindly shared part of their data with us ahead of publication. In each sample of this data set, a gene was engineered to be over-expressed, and we compare two biological replicates each of two such conditions, in the following denoted as 'A' and 'B'.

##### Tag-Seq of neural stem cells

Engström *et al.* [18] performed Tag-Seq [19] for tissue cultures of neural cells, including four from glioblastoma-derived neural stem-cells ('GNS') and two from non-cancerous neural stem ('NS') cells. As each tissue culture was derived from a different subject and so has a different genotype, these data show high variability.

##### RNA-Seq of yeast

Nagalakshmi *et al.* [1] performed RNA-Seq on replicates of *Saccharomyces cerevisiae* cultures. They tested two library preparation protocols, *dT* and *RH*, and obtained three sequencing runs for each protocol, such that for the first run of each protocol, they had one further technical replicate (same culture, replicated library preparation) and one further biological replicate (different culture).

##### ChIP-Seq of HapMap samples

Kasowski *et al.* [20] compared protein occupation of DNA regions between ten human individuals by ChIP-Seq. They compiled a list of regions for polymerase II and NF-κB, and counted, for each sample, the number of reads that mapped onto each region. The aim of the study was to investigate how much the regions' occupation differed between individuals.

#### Variance estimation

We start by demonstrating the variance estimation. Figure 1a shows the sample variances *w*_{iρ}(Equation (7)) plotted against the means {\widehat{q}}_{i\rho} (Equation (6)) for condition *A* in the fly RNA-Seq data. Also shown is the local regression fit *w*_{ρ} (*q*) and the shot noise {\widehat{s}}_{j}{\widehat{q}}_{i\rho}. In Figure 1b, we plotted the squared coefficient of variation (SCV), that is the ratio of the variance to the mean squared. In this plot, the distance between the orange and the purple line is the SCV of the noise due to biological sampling (cf. Equation (3)).

The many data points in Figure 1b that lie far above the fitted orange curve may let the fit of the local regression appear poor. However, a strong skew of the residual distribution is to be expected. See Supplementary Note E in Additional file 1 for details and a discussion of diagnostics suitable to verify the fit.

#### Testing

In order to verify that *DESeq* maintains control of type-I error, we contrasted one of the replicates for condition *A* in the fly data against the other one, using for both samples the variance function estimated from the two replicates. Figure 2 shows the empirical cumulative distribution functions (ECDFs) of the *P* values obtained from this comparison. To control type-I error, the proportion of *P* values below a threshold *α* has to be ≤ *α*, that is, the ECDF curve (blue line) should not get above the diagonal (gray line). As the figure indicates, type-I error is controlled by *edgeR* and *DESeq*, but not by a Poisson-based *χ*^{2} test. The latter underestimates the variability of the data and would thus make many false positive rejections. In addition to this evaluation on real data, we also verified *DESeq*'s type-I error control on simulated data that were generated from the error model described above; see Supplementary Note G in Additional file 1. Next, we contrasted the two *A* samples against the two *B* samples. Using the procedure described in the previous Section, we computed a *P* value for each gene. Figure 3 shows the obtained fold changes and *P* values. 12% of the P values were below 5%. Adjustment for multiple-testing with the procedure of Benjamini and Hochberg [21] yielded significant differential expression at false discovery rate (FDR) of 10% for 864 genes (of 17,605). These are marked in red in the figure. Figure 3 demonstrates how the ability to detect differential expression depends on overall counts. Specifically, the strong shot noise for low counts causes the testing procedure to call only very high fold changes significant. It can also be seen that, for counts below approximately 100, even a small increase in count levels reduces the impact of shot noise and hence the fold-change requirement, while at higher counts, when shot noise becomes unimportant (cf. Figure 1b), the fold-change cut-off depends only weakly on count level. These plots are helpful to guide experiment design: For weakly expressed genes, in the region where shot noise is important, power can be increased by deeper sequencing, while for the higher-count regime, increased power can only be achieved with further biological replicates.

#### Comparison with edgeR

We also analyzed the data with *edgeR* (version 1.6.0; [8, 10, 11]). We ran *edgeR* with four different settings, namely in common-dispersion and in tagwise-dispersion mode, and either using the size factors as estimated by *DESeq* or taking the total numbers of sequenced reads. The results did not depend much on these choices, and here we report the results for tag-wise dispersion mode with *DESeq*-estimated size factors. (The R code required to reproduce all analyses, figures and numbers reported in this article is provided in Additional file 2; in addition, this supplement provides the results for the other settings of *edgeR*. The raw data can be found in Additional file 3.)

Going back to Figure 1 we see that *edgeR*'s single-value dispersion estimate of the variance is lower than that of *DESeq* for weakly expressed genes and higher for strongly expressed genes. As a consequence, as we have seen in Figure 2*edgeR* is anti-conservative for lowly expressed genes. However, it compensates for this by being more conservative with strongly expressed genes, so that, on average, type-I error control is maintained.

Nevertheless, in a test between different conditions, this behavior can result in a bias in the list of discoveries; for the present data, as Figure 4 shows, weakly expressed genes seem to be overrepresented, while very few genes with high average level are called differentially expressed by *edgeR*. While overall the sensitivity of both methods seemed comparable (*DESeq* reported 864 hits, *edgeR* 1, 127 hits), *DESeq* produced results which were more balanced over the dynamic range.

Similar results were obtained with the neural stem cell data, a data set with a different biological background and different noise characteristics (see Supplementary Note F in Additional file 1). The flexibility of the variance estimation scheme presented in this work appears to offer real advantages over the existing methods across a range of applications.

#### Working without replicates

*DESeq* allows analysis of experiments with no biological replicates in one or even both of the conditions. While one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation.

If replicates are available only for one of the conditions, one might choose to assume that the variance-mean dependence estimated from the data for that condition holds as well for the unreplicated one.

If neither condition has replicates, one can still perform an analysis based on the assumption that for most genes, there is no true differential expression, and that a valid mean-variance relationship can be estimated from treating the two samples as if they were replicates. A minority of differentially abundant genes will act as outliers; however, they will not have a severe impact on the gamma-family GLM fit, as the gamma distribution for low values of the shape parameter has a heavy right-hand tail. Some overestimation of the variance may be expected, which will make that approach conservative.

We performed such an analysis with the fly RNA-Seq and the neural cell Tag-Seq data, by restricting both data sets to only two samples, one from each condition. For the neural cell data, the estimated variance function was, as expected, somewhat above the two functions estimated from the *GNS* and *NS* replicates.

Using it to test for differential expression still found 269 hits at FDR = 10%, of which 202 were among the 612 hits from the more reliable analysis with all available samples. In the case of the fly RNA-Seq data, however, only 90 of the 862 hits (11%) were recovered (with two new hits). These observations are explained by the fact that in the neural cell data, the variability between replicates was not much smaller than between conditions, making the latter a usable surrogate for the former. On the other hand, for the fly data, the variability between replicates was much smaller than between the conditions, indicating that the replication provided important and otherwise not available information on the experimental variation in the data (see also next Section).

#### Variance-stabilizing transformation

Given a variance-mean dependence, a variance-stabilizing transformation (VST) is a monotonous mapping such that for the transformed values, the variance is (approximately) independent of the mean. Using the variance-mean dependence *w*(*q*) estimated by *DESeq*, a VST is given by

\tau (\kappa )={{\displaystyle \int}}^{\kappa}\frac{\text{d}q}{\sqrt{w(q)}}.

(15)

Applying the transformation *τ* to the common-scale count data, *k*_{ij} /*s*_{j} , yields values whose variances are approximately the same throughout the dynamic range. One application of VST is sample clustering, as in Figure 5; such an approach is more straightforward than, say, defining a suitable distance metric on the untransformed count data, whose choice is not obvious, and may not be easy to combine with available clustering or classification algorithms (which tend to be designed for variables with similar distributional properties).

#### ChIP-Seq

*DESeq* can also be used to analyze comparative ChIP-Seq assays. Kasowski *et al.* [20] analyzed transcription factor binding for HapMap individuals and counted for each sample how many reads mapped to pre-determined binding regions. We considered two individuals from their data set, HapMap IDs GM12878 and GM12891, for both of which at least four replicates had been done, and tested for differential occupation of the regions. The upper left two panels of Figure 6 which show comparisons within the same individual, indicate that type-I error was controlled by *DESeq*. No region was significant at 10% FDR using Benjamini-Hochberg adjustment. Differential occupation was found, however, when contrasting the two individuals, with 4,460 of 19,028 regions significant when only two replicates each were used and 8,442 when four replicates were used (upper right two panels).

Using an alternative approach, Kasowski *et al*. fitted generalized linear models (GLMs) of the Poisson family. This (lower row of Figure 6) resulted in an enrichment of small *P* values even for comparisons within the same individual, indicating that the variance was underestimated by the Poisson GLM, and literal use of the P values would lead to anti-conservative (overly optimistic) bias. Kasowski *et al*. addressed this and adjusted for the bias by using additional criteria for calling differential occupation.

## Comments

View archived comments (2)