A summary of the notation used in the following section is provided in Additional file 1: Table S1.
Model and normalization
The read count K
ij
for gene i in sample j is described with a GLM of the negative binomial family with a logarithmic link:
(1)
(2)
For notational simplicity, the equations here use the natural logarithm as the link function, though the DESeq2 software reports estimated model coefficients and their estimated standard errors on the log2 scale.
By default, the normalization constants s
ij
are considered constant within a sample, s
ij
=s
j
, and are estimated with the median-of-ratios method previously described and used in DESeq [4] and DEXSeq [30]:
Alternatively, the user can supply normalization constants s
ij
calculated using other methods (e.g., using cqn [13] or EDASeq [14]), which may differ from gene to gene.
Expanded design matrices
For consistency with our software’s documentation, in the following text we will use the terminology of the R statistical language. In linear modeling, a categorical variable or factor can take on two or more values or levels. In standard design matrices, one of the values is chosen as a reference value or base level and absorbed into the intercept. In standard GLMs, the choice of base level does not influence the values of contrasts (LFCs). This, however, is no longer the case in our approach using ridge-regression-like shrinkage on the coefficients (described below), when factors with more than two levels are present in the design matrix, because the base level will not undergo shrinkage while the other levels do.
To recover the desirable symmetry between all levels, DESeq2 uses expanded design matrices, which include an indicator variable for each level of each factor, in addition to an intercept column (i.e., none of the levels is absorbed into the intercept). While such a design matrix no longer has full rank, a unique solution exists because the zero-centered prior distribution (see below) provides regularization. For dispersion estimation and for estimating the width of the LFC prior, standard design matrices are used.
Contrasts
Contrasts between levels and standard errors of such contrasts can be calculated as they would in the standard design matrix case, i.e., using:
(4)
where represents a numeric contrast, e.g., 1 and −1 specifying the numerator and denominator of a simple two-level contrast, and , defined below.
Estimation of dispersions
We assume the dispersion parameter α
i
follows a log-normal prior distribution that is centered around a trend that depends on the gene’s mean normalized read count:
(5)
Here, α tr is a function of the gene’s mean normalized count,
It describes the mean-dependent expectation of the prior. σ d is the width of the prior, a hyperparameter describing how much the individual genes’ true dispersions scatter around the trend. For the trend function, we use the same parametrization as we used for DEXSeq [30], namely,
(6)
We get final dispersion estimates from this model in three steps, which implement a computationally fast approximation to a full empirical Bayes treatment. We first use the count data for each gene separately to get preliminary gene-wise dispersion estimates by maximum-likelihood estimation. Then, we fit the dispersion trend α tr. Finally, we combine the likelihood with the trended prior to get maximum a posteriori (MAP) values as final dispersion estimates. Details for the three steps follow.
Gene-wise dispersion estimates To get a gene-wise dispersion estimate for a gene i, we start by fitting a negative binomial GLM without an LFC prior for the design matrix X to the gene’s count data. This GLM uses a rough method-of-moments estimate of dispersion, based on the within-group variances and means. The initial GLM is necessary to obtain an initial set of fitted values, . We then maximize the Cox–Reid adjusted likelihood of the dispersion, conditioned on the fitted values from the initial fit, to obtain the gene-wise estimate , i.e.,
with
(7)
where f NB(k;μ,α) is the probability mass function of the negative binomial distribution with mean μ and dispersion α, and the second term provides the Cox–Reid bias adjustment [47]. This adjustment, first used in the context of dispersion estimation for SAGE data [48] and then for HTS data [3] in edgeR, corrects for the negative bias of dispersion estimates from using the MLEs for the fitted values (analogous to Bessel’s correction in the usual sample variance formula; for details, see [49], Section 10.6). It is formed from the Fisher information for the fitted values, which is here calculated as det(X tW X), where W is the diagonal weight matrix from the standard iteratively reweighted least-squares algorithm. As the GLM’s link function is g(μ)= log(μ) and its variance function is V(μ;α)=μ+α μ 2, the elements of the diagonal matrix W
i
are given by:
The optimization in Equation (7) is performed on the scale of logα using a backtracking line search with accepted proposals that satisfy Armijo conditions [50].
Dispersion trend A parametric curve of the form (6) is fit by regressing the gene-wise dispersion estimates onto the means of the normalized counts, . The sampling distribution of the gene-wise dispersion estimate around the true value α
i
can be highly skewed, and therefore we do not use ordinary least-squares regression but rather gamma-family GLM regression. Furthermore, dispersion outliers could skew the fit and hence a scheme to exclude such outliers is used.
The hyperparameters a 1 and α 0 of (6) are obtained by iteratively fitting a gamma-family GLM. At each iteration, genes with a ratio of dispersion to fitted value outside the range [10−4,15] are left out until the sum of squared LFCs of the new coefficients over the old coefficients is less than 10−6 (same approach as in DEXSeq [30]).
The parametrization (6) is based on reports by us and others of decreasing dependence of dispersion on the mean in many datasets [3]-[6],[51]. Some caution is warranted to disentangle true underlying dependence from effects of estimation bias that can create a perceived dependence of the dispersion on the mean. Consider a negative binomial distributed random variable with expectation μ and dispersion α. Its variance v=μ+α μ 2 has two components, v=v P+v D, the Poisson component v P=μ independent of α, and the overdispersion component v D=α μ 2. When μ is small, μ≲1/α (vertical lines in Additional file 1: Figure S1), the Poisson component dominates, in the sense that , and the observed data provide little information on the value of α. Therefore the sampling variance of an estimator for α will be large when μ≲1/α, which leads to the appearance of bias. For simplicity, we have stated the above argument without regard to the influence of the size factors, s
j
, on the value of μ. This is permissible because, by construction, the geometric mean of our size factors is close to 1, and hence, the mean across samples of the unnormalized read counts, , and the mean of the normalized read counts, , will be roughly the same.
This phenomenon may give rise to an apparent dependence of α on μ. It is possible that the shape of the dispersion-mean fit for the Bottomly data (Figure 1A) can be explained in that manner: the asymptotic dispersion is α 0≈0.01, and the non-zero slope of the mean-dispersion plot is limited to the range of mean counts up to around 100, the reciprocal of α 0. However, overestimation of α in that low-count range has little effect on inference, as in that range the variance v is anyway dominated by the α-independent Poisson component v P. The situation is different for the Pickrell data: here, a dependence of dispersion on mean was observed for counts clearly above the reciprocal of the asymptotic dispersion α 0 (Figure 1B), and hence is not due merely to estimation bias. Simulations (shown in Additional file 1: Figure S25) confirmed that the observed joint distribution of estimated dispersions and means is not compatible with a single, constant dispersion. Therefore, the parametrization (6) is a flexible and mildly conservative modeling choice: it is able to pick up dispersion-mean dependence if it is present, while it can lead to a minor loss of power in the low-count range due to a tendency to overestimate dispersion there.
Dispersion prior As also observed by Wu et al. [6], a log-normal prior fits the observed dispersion distribution for typical RNA-seq datasets. We solve the computational difficulty of working with a non-conjugate prior using the following argument: the logarithmic residuals from the trend fit, , arise from two contributions, namely the scatter of the true logarithmic dispersions around the trend, given by the prior with variance , and the sampling distribution of the logarithm of the dispersion estimator, with variance . The sampling distribution of a dispersion estimator is approximately a scaled χ 2 distribution with m−p degrees of freedom, with m the number of samples and p the number of coefficients. The variance of the logarithm of a -distributed random variable is given [52] by the trigamma function ψ 1,
Therefore, , i.e., the sampling variance of the logarithm of a variance or dispersion estimator is approximately constant across genes and depends only on the degrees of freedom of the model.
Additional file 1: Table S2 compares this approximation for the variance of logarithmic dispersion estimates with the variance of logarithmic Cox–Reid adjusted dispersion estimates for simulated negative binomial data, over a combination of different sample sizes, number of parameters and dispersion values used to create the simulated data. The approximation is close to the sample variance for various typical values of m, p and α.
Therefore, the prior variance is obtained by subtracting the expected sampling variance from an estimate of the variance of the logarithmic residuals, :
The prior variance is thresholded at a minimal value of 0.25 so that the dispersion estimates are not shrunk entirely to if the variance of the logarithmic residuals is less than the expected sampling variance.
To avoid inflation of due to dispersion outliers (i.e., genes not well captured by this prior; see below), we use a robust estimator for the standard deviation s lr of the logarithmic residuals,
(8)
where mad stands for the median absolute deviation, divided as usual by the scaling factor Φ −1(3/4).
Three or less residuals degrees of freedom When there are three or less residual degrees of freedom (number of samples minus number of parameters to estimate), the estimation of the prior variance using the observed variance of logarithmic residuals tends to underestimate . In this case, we instead estimate the prior variance through simulation. We match the distribution of logarithmic residuals to a density of simulated logarithmic residuals. These are the logarithm of -distributed random variables added to random variables to account for the spread due to the prior. The simulated distribution is shifted by − log(m−p) to account for the scaling of the χ 2 distribution. We repeat the simulation over a grid of values for , and select the value that minimizes the Kullback–Leibler divergence from the observed density of logarithmic residuals to the simulated density.
Final dispersion estimate We form a logarithmic posterior for the dispersion from the Cox–Reid adjusted logarithmic likelihood (7) and the logarithmic prior (5) and use its maximum (i.e., the MAP value) as the final estimate of the dispersion,
(9)
where
is, up to an additive constant, the logarithm of the density of prior (5). Again, a backtracking line search is used to perform the optimization.
Dispersion outliers For some genes, the gene-wise estimate can be so far above the prior expectation that it would be unreasonable to assume that the prior is suitable for the gene. If the dispersion estimate for such genes were down-moderated toward the fitted trend, this might lead to false positives. Therefore, we use the heuristic of considering a gene as a dispersion outlier, if the residual from the trend fit is more than two standard deviations of logarithmic residuals, s lr (see Equation (8)), above the fit, i.e., if
For such genes, the gene-wise estimate is not shrunk toward the trended prior mean. Instead of the MAP value , we use the gene-wise estimate as a final dispersion value in the subsequent steps. In addition, the iterative fitting procedure for the parametric dispersion trend described above avoids that such dispersion outliers influence the prior mean.
Shrinkage estimation of logarithmic fold changes
To incorporate empirical Bayes shrinkage of LFCs, we postulate a zero-centered normal prior for the coefficients β
ir
of model (2) that represent LFCs (i.e., typically, all coefficients except for the intercept β i0):
As was observed with differential expression analysis using microarrays, genes with low intensity values tend to suffer from a small signal-to-noise ratio. Alternative estimators can be found that are more stable than the standard calculation of fold change as the ratio of average observed values for each condition [53]-[55]. DESeq2’s approach can be seen as an extension of these approaches for stable estimation of gene-expression fold changes to count data.
Empirical prior estimate To obtain values for the empirical prior widths σ
r
for the model coefficients, we again approximate a full empirical Bayes approach, as with the estimation of dispersion prior, though here we do not subtract the expected sampling variance from the observed variance of maximum likelihood estimates. The estimate of the LFC prior width is calculated as follows. We use the standard iteratively reweighted least-squares algorithm [12] for each gene’s model, Equations (1) and (2), to get MLEs for the coefficients . We then fit, for each column r of the design matrix (except for the intercept), a zero-centered normal distribution to the empirical distribution of MLE fold change estimates .
To make the fit robust against outliers with very high absolute LFC values, we use quantile matching: the width σ
r
is chosen such that the (1−p) empirical quantile of the absolute value of the observed LFCs, , matches the (1−p/2) theoretical quantile of the prior, , where p is set by default to 0.05. If we write the theoretical upper quantile of a normal distribution as Q
N
(1−p) and the empirical upper quantile of the MLE LFCs as , then the prior width is calculated as:
To ensure that the prior width σ
r
will be independent of the choice of base level, the estimates from the quantile matching procedure are averaged for each factor over all possible contrasts of factor levels. When determining the empirical upper quantile, extreme LFC values (, or 10 on the base 2 scale) are excluded.
Final estimate of logarithmic fold changes The logarithmic posterior for the vector, , of model coefficients β
ir
for gene i is the sum of the logarithmic likelihood of the GLM (2) and the logarithm of the prior density (10), and its maximum yields the final MAP coefficient estimates:
where
and α
i
is the final dispersion estimate for gene i, i.e., , except for dispersion outliers, where .
The term Λ(β), i.e., the logarithm of the density of the normal prior (up to an additive constant), can be read as a ridge penalty term, and therefore, we perform the optimization using the iteratively reweighted ridge regression algorithm [56], also known as weighted updates [57]. Specifically, the updates for a given gene are of the form
with and
where the current fitted values are computed from the current estimates in each iteration.
Fisher information. The effect of the zero-centered normal prior can be understood as shrinking the MAP LFC estimates based on the amount of information the experiment provides for this coefficient, and we briefly elaborate on this here. Specifically, for a given gene i, the shrinkage for an LFC β
ir
depends on the observed Fisher information, given by
where is the logarithm of the likelihood, and partial derivatives are taken with respect to LFC β
ir
. For a negative binomial GLM, the observed Fisher information, or peakedness of the logarithm of the profile likelihood, is influenced by a number of factors including the degrees of freedom, the estimated mean counts μ
ij
, and the gene’s dispersion estimate α
i
. The prior influences the MAP estimate when the density of the likelihood and the prior are multiplied to calculate the posterior. Genes with low estimated mean values μ
ij
or high dispersion estimates α
i
have flatter profile likelihoods, as do datasets with few residual degrees of freedom, and therefore in these cases the zero-centered prior pulls the MAP estimate from a high-uncertainty MLE closer toward zero.
Wald test
The Wald test compares the beta estimate β
ir
divided by its estimated standard error SE(β
ir
) to a standard normal distribution. The estimated standard errors are the square root of the diagonal elements of the estimated covariance matrix, Σ
i
, for the coefficients, i.e., . Contrasts of coefficients are tested similarly by forming a Wald statistics using (3) and (4). We use the following formula for the coefficient covariance matrix for a GLM with normal prior on coefficients [56], [58]:
The tail integrals of the standard normal distribution are multiplied by 2 to achieve a two-tailed test. The Wald test P values from the subset of genes that pass the independent filtering step are adjusted for multiple testing using the procedure of Benjamini and Hochberg [21].
Independent filtering
Independent filtering does not compromise type-I error control as long as the distribution of the test statistic is marginally independent of the filter statistic under the null hypothesis [22], and we argue in the following that this is the case in our application. The filter statistic in DESeq2 is the mean of normalized counts for a gene, while the test statistic is p, the P value from the Wald test. We first consider the case where the size factors are equal and where the gene-wise dispersion estimates are used for each gene, i.e. without dispersion shrinkage. The distribution family for the negative binomial is parameterized by θ=(μ,α). Aside from discreteness of p due to low counts, for a given μ, the distribution of p is Uniform(0,1) under the null hypothesis, so p is an ancillary statistic. The sample mean of counts for gene i, , is boundedly complete sufficient for μ. Then from Basu’s theorem, and p are independent.
While for very low counts, one can observe discreteness and non-uniformity of p under the null hypothesis, DESeq2 does not use the distribution of p in its estimation procedure – for example, DESeq2 does not estimate the proportion of null genes using the distribution of p – so this kind of dependence of p on μ does not lead to increased type-I error.
If the size factors are not equal across samples, but not correlated with condition, conditioning on the mean of normalized counts should also provide uniformly distributed p as with conditioning on the mean of counts, . We may consider a pathological case where the size factors are perfectly confounded with condition, in which case, even under the null hypothesis, genes with low mean count would have non-uniform distribution of p, as one condition could have positive counts and the other condition often zero counts. This could lead to non-uniformity of p under the null hypothesis; however, such a pathological case would pose problems for many statistical tests of differences in mean.
We used simulation to demonstrate that the independence of the null distribution of the test statistic from the filter statistic still holds for dispersion shrinkage. Additional file 1: Figure S26 displays marginal null distributions of p across the range of mean normalized counts. Despite spikes in the distribution for the genes with the lowest mean counts due to discreteness of the data, these densities were nearly uniform across the range of average expression strength.
Composite null hypotheses
DESeq2 offers tests for composite null hypotheses of the form to find genes whose LFC significantly exceeds a threshold θ>0. The composite null hypothesis is replaced by two simple null hypotheses: and . Two-tailed P values are generated by integrating a normal distribution centered on θ with standard deviation SE(β
ir
) from |β
ir
| toward ∞. The value of the integral is then multiplied by 2 and thresholded at 1. This procedure controls type-I error even when β
ir
=±θ, and is equivalent to the standard DESeq2 P value when θ=0.
Conversely, when searching for genes whose absolute LFC is significantly below a threshold, i.e., when testing the null hypothesis , the P value is constructed as the maximum of two one-sided tests of the simple null hypotheses: and . The one-sided P values are generated by integrating a normal distribution centered on θ with standard deviation SE(β
ir
) from β
ir
toward −∞, and integrating a normal distribution centered on −θ with standard deviation SE(β
ir
) from β
ir
toward ∞.
Note that while a zero-centered prior on LFCs is consistent with testing the null hypothesis of small LFCs, it should not be used when testing the null hypothesis of large LFCs, because the prior would then favor the alternative hypothesis. DESeq2 requires that no prior has been used when testing the null hypothesis of large LFCs, so that the data alone must provide evidence against the null hypothesis.
Interactions
Two exceptions to the default DESeq2 LFC estimation steps are used for experimental designs with interaction terms. First, when any interaction terms are included in the design, the LFC prior width for main effect terms is not estimated from the data, but set to a wide value (, or 1000 on the base 2 scale). This ensures that shrinkage of main effect terms will not result in false positive calls of significance for interactions. Second, when interaction terms are included and all factors have two levels, then standard design matrices are used rather than expanded model matrices, such that only a single term is used to test the null hypothesis that a combination of two effects is merely additive in the logarithmic scale.
Regularized logarithm
The rlog transformation is calculated as follows. The experimental design matrix X is substituted with a design matrix with an indicator variable for every sample in addition to an intercept column. A model as described in Equations (1) and (2) is fit with a zero-centered normal prior on the non-intercept terms and using the fitted dispersion values , which capture the overall variance-mean dependence of the dataset. The true experimental design matrix X is then only used in estimating the variance-mean trend over all genes. For unsupervised analyses, for instance sample quality assessment, it is desirable that the experimental design has no influence on the transformation, and hence DESeq2 by default ignores the design matrix and re-estimates the dispersions treating all samples as replicates, i.e., it uses blind dispersion estimation. The rlog-transformed values are the fitted values,
where β
ij
is the shrunken LFC on the base 2 scale for the jth sample. The variance of the prior is set using a similar approach as taken with differential expression, by matching a zero-centered normal distribution to observed LFCs. First a matrix of LFCs is calculated by taking the logarithm (base 2) of the normalized counts plus a pseudocount of for each sample divided by the mean of normalized counts plus a pseudocount of . The pseudocount of allows for calculation of the logarithmic ratio for all genes, and has little effect on the estimate of the variance of the prior or the final rlog transformation. This matrix of LFCs then represents the common-scale logarithmic ratio of each sample to the fitted value using only an intercept. The prior variance is found by matching the 97.5% quantile of a zero-centered normal distribution to the 95% quantile of the absolute values in the LFC matrix.
Cook’s distance for outlier detection
The MLE of is used for calculating Cook’s distance. Considering a gene i and sample j, Cook’s distance for GLMs is given by [59]:
where R
ij
is the Pearson residual of sample j, τ is an overdispersion parameter (in the negative binomial GLM, τ is set to 1), p is the number of parameters including the intercept, and h
jj
is the jth diagonal element of the hat matrix H:
Pearson residuals R
ij
are calculated as
where μ
ij
is estimated by the negative binomial GLM without the LFC prior, and using the variance function V(μ)=μ+α μ 2. A method-of-moments estimate , using a robust estimator of variance to provide robustness against outliers, is used here:
R/Bioconductor package
DESeq2 is implemented as a package for the R statistical environment and is available [10] as part of the Bioconductor project [11]. The count matrix and metadata, including the gene model and sample information, are stored in an S4 class derived from the SummarizedExperiment class of the GenomicRanges package [60]. SummarizedExperiment objects containing count matrices can be easily generated using the summarizeOverlaps function of the GenomicAlignments package [61]. This workflow automatically stores the gene model as metadata and additionally other information such as the genome and gene annotation versions. Other methods to obtain count matrices include the htseq-count script [62] and the Bioconductor packages easyRNASeq [63] and featureCount [64].
The DESeq2 package comes with a detailed vignette, which works through a number of example differential expression analyses on real datasets, and the use of the rlog transformation for quality assessment and visualization. A single function, called DESeq, is used to run the default analysis, while lower-level functions are also available for advanced users.
Read alignment for the Bottomly et al. and Pickrell et al.datasets
Reads were aligned using the TopHat2 aligner [65], and assigned to genes using the summarizeOverlaps function of the GenomicRanges package [60]. The sequence read archive fastq files of the Pickrell et al. [17] dataset (accession number [SRA:SRP001540]) were aligned to the Homo sapiens reference sequence GRCh37 downloaded in March 2013 from Illumina iGenomes. Reads were counted in the genes defined by the Ensembl GTF file, release 70, contained in the Illumina iGenome. The sequence read archive fastq files of the Bottomly et al. [16] dataset (accession number [SRA:SRP004777]) were aligned to the Mus musculus reference sequence NCBIM37 downloaded in March 2013 from Illumina iGenomes. Reads were counted in the genes defined by the Ensembl GTF file, release 66, contained in the Illumina iGenome.
Reproducible code
Sweave vignettes for reproducing all figures and tables in this paper, including data objects for the experiments mentioned, and code for aligning reads and for benchmarking, can be found in a package DESeq2paper [66].