Differential expression analysis for sequence count data
 Simon Anders^{1}Email author and
 Wolfgang Huber^{1}
DOI: 10.1186/gb20101110r106
© Anders et al 2010
Received: 20 April 2010
Accepted: 27 October 2010
Published: 27 October 2010
Abstract
Highthroughput sequencing assays such as RNASeq, ChIPSeq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
Background
Highthroughput sequencing of DNA fragments is used in a range of quantitative assays. A common feature between these assays is that they sequence large amounts of DNA fragments that reflect, for example, a biological system's repertoire of RNA molecules (RNASeq [1, 2]) or the DNA or RNA interaction regions of nucleotide binding molecules (ChIPSeq [3], HITSCLIP [4]). Typically, these reads are assigned to a class based on their mapping to a common region of the target genome, where each class represents a target transcript, in the case of RNASeq, or a binding region, in the case of ChIPSeq. An important summary statistic is the number of reads in a class; for RNASeq, this read count has been found to be (to good approximation) linearly related to the abundance of the target transcript [2]. Interest lies in comparing read counts between different biological conditions. In the simplest case, the comparison is done separately, class by class. We will use the term gene synonymously to class, even though a class may also refer to, for example, a transcription factor binding site, or even a barcode [5].
We would like to use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.
If reads were independently sampled from a population with given, fixed fractions of genes, the read counts would follow a multinomial distribution, which can be approximated by the Poisson distribution.
Consequently, the Poisson distribution has been used to test for differential expression [6, 7]. The Poisson distribution has a single parameter, which is uniquely determined by its mean; its variance and all other properties follow from it; in particular, the variance is equal to the mean. However, it has been noted [1, 8] that the assumption of Poisson distribution is too restrictive: it predicts smaller variations than what is seen in the data. Therefore, the resulting statistical test does not control typeI error (the probability of false discoveries) as advertised. We show instances for this later, in the Discussion.
To address this socalled overdispersion problem, it has been proposed to model count data with negative binomial (NB) distributions [9], and this approach is used in the edgeR package for analysis of SAGE and RNASeq [8, 10]. The NB distribution has parameters, which are uniquely determined by mean μ and variance σ^{2}. However, the number of replicates in data sets of interest is often too small to estimate both parameters, mean and variance, reliably for each gene. For edgeR, Robinson and Smyth assumed [11] that mean and variance are related by σ^{2} = μ + αμ^{2}, with a single proportionality constant α that is the same throughout the experiment and that can be estimated from the data. Hence, only one parameter needs to be estimated for each gene, allowing application to experiments with small numbers of replicates.
In this paper, we extend this model by allowing more general, datadriven relationships of variance and mean, provide an effective algorithm for fitting the model to data, and show that it provides better fits (Section Model). As a result, more balanced selection of differentially expressed genes throughout the dynamic range of the data can be obtained (Section Testing for differential expression). We demonstrate the method by applying it to four data sets (Section Applications) and discuss how it compares to alternative approaches (Section Conclusions).
Results and Discussion
Model
Description
which has two parameters, the mean μ_{ ij } and the variance ${\sigma}_{ij}^{2}$. The read counts K_{ ij } are nonnegative integers. The probabilities of the distribution are given in Supplementary Note A. (All Supplementary Notes are in Additional file 1.) The NB distribution is commonly used to model count data when overdispersion is present [12].
In practice, we do not know the parameters μ_{ ij } and ${\sigma}_{ij}^{2}$, and we need to estimate them from the data. Typically, the number of replicates is small, and further modelling assumptions need to be made in order to obtain useful estimates. In this paper, we develop a method that is based on the following three assumptions.
q_{i,ρ(j) }is proportional to the expectation value of the true (but unknown) concentration of fragments from gene i under condition ρ(j). The size factor s_{ j }represents the coverage, or sampling depth, of library j, and we will use the term common scale for quantities, such as q_{i, ρ(j)}, that are adjusted for coverage by dividing by s_{ j }.
This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene i from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.
The decomposition of the variance in Equation (3) is motivated by the following hierarchical model: We assume that the actual concentration of fragments from gene i in sample j is proportional to a random variable R_{ ij } , such that the rate that fragments from gene i are sequenced is s_{ j }r_{ ij } . For each gene i and all samples j of condition ρ, the R_{ ij } are i.i.d. with mean q_{ iρ } and variance v_{ iρ } . Thus, the count value K_{ ij } , conditioned on R_{ ij } = r_{ ij } , is Poisson distributed with rate s_{ j }r_{ ij } . The marginal distribution of K_{ ij }  when allowing for variation in R_{ ij }  has the mean μ_{ ij } and (according to the law of total variance) the variance given in Equation (3). Furthermore, if the higher moments of R_{ ij } are modeled according to a gamma distribution, the marginal distribution of K_{ ij } is NB (see, for example, [12], Section 4.2.2).
Fitting
We now describe how the model can be fitted to data. The data are an n × m table of counts, k_{ ij } , where i = 1,..., n indexes the genes, and j = 1,..., m indexes the samples. The model has three sets of parameters:
(i) m size factors s_{ j } ; the expectation values of all counts from sample j are proportional to s_{ j } .
(ii) for each experimental condition ρ, n expression strength parameters q_{ iρ } ; they reflect the expected abundance of fragments from gene i under condition ρ, that is, expectation values of counts for gene i are proportional to q_{ iρ } .
(iii) The smooth functions v_{ ρ } : ℝ^{+} → ℝ^{+}; for each condition ρ, v_{ ρ } models the dependence of the raw variance v_{ iρ } on the expected mean q_{ iρ } .
The purpose of the size factors s_{ j } is to render counts from different samples, which may have been sequenced to different depths, comparable. Hence, the ratios ($\mathbb{E}$K_{ ij } )/($\mathbb{E}$K_{ ij' } ) of expected counts for the same gene i in different samples j and j' should be equal to the size ratio s_{ j } /s_{ j' } if gene i is not differentially expressed or samples j and j' are replicates. The total number of reads, Σ _{ i } k_{ ij } , may seem to be a good measure of sequencing depth and hence a reasonable choice for s_{ j } . Experience with real data, however, shows this not always to be the case, because a few highly and differentially expressed genes may have strong influence on the total read count, causing the ratio of total read counts not to be a good estimate for the ratio of expected counts.
The denominator of this expression can be interpreted as a pseudoreference sample obtained by taking the geometric mean across samples. Thus, each size factor estimate ${\widehat{s}}_{j}$ is computed as the median of the ratios of the jth sample's counts to those of the pseudoreference. (Note: While this manuscript was under review, Robinson and Oshlack [13] suggested a similar method.)
In Supplementary Note B in Additional file 1 we show that w_{ iρ }  z_{ iρ } is an unbiased estimator for the raw variance parameter v_{ iρ } of Equation (3).
as our estimate for the raw variance.
Some attention is needed to avoid estimation biases in the local regression. w_{ iρ } is a sum of squared random variables, and the residuals ${w}_{i\rho}w({\widehat{q}}_{i\rho})$ are skewed. Following References [15], Chapter 8 and [14], Section 9.1.2, we use a generalized linear model of the gamma family for the local regression, using the implementation in the locfit package [16].
Testing for differential expression
The variables a and b in the above sums take the values 0,..., k_{iS}. The approach presented so far follows that of Robinson and Smyth [11] and is analogous to that taken by other conditioned tests, such as Fisher's exact test. (See Reference [17], Chapter 3 for a discussion of the merits of conditioning in tests.)
Computation of p(a, b). First, assume that, under the null hypothesis, counts from different samples are independent. Then, p(a, b) = Pr(K_{iA }= a) Pr(K_{iB }= b). The problem thus is computing the probability of the event K_{iA }= a, and, analogously, of K_{iB }= b. The random variable K_{iA }is the sum of m_{ A }
Supplementary Note C in Additional file 1 describes how the distribution parameters of the NB for K_{iA }can be determined from ${\widehat{\mu}}_{i\text{A}}$ and ${\widehat{\sigma}}_{i\text{A}}^{2}$. (To avoid bias, we do not match the moments directly, but instead match a different pair of distribution statistics.) The parameters of K_{iB }are obtained analogously.
Supplementary Note D in Additional file 1 explains how we evaluate the sums in Equation (11).
Applications
Data sets
We present results based on the following data sets:
RNASeq in fly embryos
B. Wilczynski, Y.H. Liu, N. Delhomme and E. Furlong have conducted RNASeq experiments in fly embryos and kindly shared part of their data with us ahead of publication. In each sample of this data set, a gene was engineered to be overexpressed, and we compare two biological replicates each of two such conditions, in the following denoted as 'A' and 'B'.
TagSeq of neural stem cells
Engström et al. [18] performed TagSeq [19] for tissue cultures of neural cells, including four from glioblastomaderived neural stemcells ('GNS') and two from noncancerous neural stem ('NS') cells. As each tissue culture was derived from a different subject and so has a different genotype, these data show high variability.
RNASeq of yeast
Nagalakshmi et al. [1] performed RNASeq on replicates of Saccharomyces cerevisiae cultures. They tested two library preparation protocols, dT and RH, and obtained three sequencing runs for each protocol, such that for the first run of each protocol, they had one further technical replicate (same culture, replicated library preparation) and one further biological replicate (different culture).
ChIPSeq of HapMap samples
Kasowski et al. [20] compared protein occupation of DNA regions between ten human individuals by ChIPSeq. They compiled a list of regions for polymerase II and NFκB, and counted, for each sample, the number of reads that mapped onto each region. The aim of the study was to investigate how much the regions' occupation differed between individuals.
Variance estimation
The many data points in Figure 1b that lie far above the fitted orange curve may let the fit of the local regression appear poor. However, a strong skew of the residual distribution is to be expected. See Supplementary Note E in Additional file 1 for details and a discussion of diagnostics suitable to verify the fit.
Testing
Comparison with edgeR
We also analyzed the data with edgeR (version 1.6.0; [8, 10, 11]). We ran edgeR with four different settings, namely in commondispersion and in tagwisedispersion mode, and either using the size factors as estimated by DESeq or taking the total numbers of sequenced reads. The results did not depend much on these choices, and here we report the results for tagwise dispersion mode with DESeqestimated size factors. (The R code required to reproduce all analyses, figures and numbers reported in this article is provided in Additional file 2; in addition, this supplement provides the results for the other settings of edgeR. The raw data can be found in Additional file 3.)
Going back to Figure 1 we see that edgeR's singlevalue dispersion estimate of the variance is lower than that of DESeq for weakly expressed genes and higher for strongly expressed genes. As a consequence, as we have seen in Figure 2edgeR is anticonservative for lowly expressed genes. However, it compensates for this by being more conservative with strongly expressed genes, so that, on average, typeI error control is maintained.
Similar results were obtained with the neural stem cell data, a data set with a different biological background and different noise characteristics (see Supplementary Note F in Additional file 1). The flexibility of the variance estimation scheme presented in this work appears to offer real advantages over the existing methods across a range of applications.
Working without replicates
DESeq allows analysis of experiments with no biological replicates in one or even both of the conditions. While one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation.
If replicates are available only for one of the conditions, one might choose to assume that the variancemean dependence estimated from the data for that condition holds as well for the unreplicated one.
If neither condition has replicates, one can still perform an analysis based on the assumption that for most genes, there is no true differential expression, and that a valid meanvariance relationship can be estimated from treating the two samples as if they were replicates. A minority of differentially abundant genes will act as outliers; however, they will not have a severe impact on the gammafamily GLM fit, as the gamma distribution for low values of the shape parameter has a heavy righthand tail. Some overestimation of the variance may be expected, which will make that approach conservative.
We performed such an analysis with the fly RNASeq and the neural cell TagSeq data, by restricting both data sets to only two samples, one from each condition. For the neural cell data, the estimated variance function was, as expected, somewhat above the two functions estimated from the GNS and NS replicates.
Using it to test for differential expression still found 269 hits at FDR = 10%, of which 202 were among the 612 hits from the more reliable analysis with all available samples. In the case of the fly RNASeq data, however, only 90 of the 862 hits (11%) were recovered (with two new hits). These observations are explained by the fact that in the neural cell data, the variability between replicates was not much smaller than between conditions, making the latter a usable surrogate for the former. On the other hand, for the fly data, the variability between replicates was much smaller than between the conditions, indicating that the replication provided important and otherwise not available information on the experimental variation in the data (see also next Section).
Variancestabilizing transformation
ChIPSeq
Using an alternative approach, Kasowski et al. fitted generalized linear models (GLMs) of the Poisson family. This (lower row of Figure 6) resulted in an enrichment of small P values even for comparisons within the same individual, indicating that the variance was underestimated by the Poisson GLM, and literal use of the P values would lead to anticonservative (overly optimistic) bias. Kasowski et al. addressed this and adjusted for the bias by using additional criteria for calling differential occupation.
Conclusions
Why is it necessary to develop new statistical methodology for sequence count data? If large numbers of replicates were available, questions of data distribution could be avoided by using nonparametric methods, such as rankbased or permutation tests. However, it is desirable (and possible) to consider experiments with smaller numbers of replicates per condition. In order to compare an observed difference with an expected random variation, we can improve our picture of the latter in two ways: first, we can use distribution families, such as normal, Poisson and negative binomial distributions, in order to determine the higher moments, and hence the tail behavior, of statistics for differential expression, based on observed low order moments such as mean and variance. Second, we can share information, for instance, distributional parameters, between genes, based on the notion that data from different genes follow similar patterns of variability. Here, we have described an instance of such an approach, and we will now discuss the choices we have made.
Choice of distribution
While for large counts, normal distributions might provide a good approximation of betweenreplicate variability, this is not the case for lower count values, whose discreteness and skewness mean that probability estimates computed from a normal approximation would be inadequate.
For the Poisson approximation, a key paper is the work by Marioni et al. [6], who studied the technical reproducibility of RNASeq. They extracted total RNA from two tissue samples, one from the liver and one from the kidneys of the same individual. From each RNA sample they took seven aliquots, prepared a library from each aliquot according to the protocol recommended by Illumina and sampled each library on one lane of a Solexa genome analyzer. For each gene, they then calculated the variance of the seven counts from the same tissue sample and found very good agreement with the variance predicted by a Poisson model. In line with our arguments in Section Model, Poisson shot noise is the minimum amount of variation to expect in a counting process. Thus, Marioni et al. concluded that the technical reproducibility of RNASeq is excellent, and that the variation between technical replicates is close to the shot noise limit. From this vantage point, Marioni et al. (and similarly Bullard et al. [22]) suggested to use the Poisson model (and Fisher's exact test, or a likelihood ratio test as an approximation to it) to test whether a gene is differentially expressed between their two samples. It is important to note that a rejection from such a test only informs us that the difference between the average counts in the two samples is larger than one would expect between technical replicates. Hence, we do not know whether this difference is due to the different tissue type, kidney instead of liver, or whether a difference of the same magnitude could have been found as well if one had compared two samples from different parts of the same liver, or from livers of two individuals.
Figure 1 shows that shot noise is only dominant for very low count values, while already for moderate counts, the effect of the biological variation between samples exceeds the shot noise by orders of magnitude.
Consequently, it is preferable to use a model that allows for overdispersion. While for the Poisson distribution, variance and mean are equal, the negative binomial distribution is a generalization that allow for the variance to be larger. The most advanced of the published methods using this distribution is likely edgeR [8]. DESeq owes its basic idea to edgeR, yet differs in several aspects.
Sharing of information between genes
First, we discovered that the use of total read counts as estimates of sequencing depth, and hence for the adjustment of observed counts between samples (as recommended by Robinson et al. [8] and others) may result in high apparent differences between replicates, and hence in poor power to detect true differences.
DESeq uses the more robust size estimate Equation (5); in fact, edgeR's power increases when it is supplied with those size estimates instead. (Note: While this paper was under review, edgeR was amended to use the method of Oshlack and Robinson [13].)
For small numbers of replicates as often encountered in practice, it is not possible to obtain simultaneously reliable estimates of the variance and mean parameters of the NB distribution. EdgeR addresses this problem by estimating a single common dispersion parameter. In our method, we make use of the possibility to estimate a more flexible, meandependent local regression. The amount of data available in typical experiments is large enough to allow for sufficiently precise local estimation of the dispersion. Over the large dynamic range that is typical for RNASeq, the raw SCV often appears to change noticeably, and taking this into account allows DESeq to avoid bias towards certain areas of the dynamic range in its differentialexpression calls (see Figure 2 and 4).
This flexibility is the most substantial difference between DESeq and edgeR, as simulations show that edgeR and DESeq perform comparably if provided with artificial data with constant SCV (Supplementary Note G in Additional file 1). EdgeR attempts to make up for the rigidity of the singleparameter noise model by allowing for an adjustment of the modelbased variance estimate with the pergene empirical variance. An empirical Bayes procedure, similar to the one originally developed for the limma package [24–26], determines how to combine these two sources of information optimally. However, for typical low replicate numbers, this socalled tagwise dispersion mode seems to have little effect (Figure 4) or even reduces edgeR's power (Supplementary Note F in Additional file 1).
Third, we have suggested a simple and robust way of estimating the raw variance from the data. Robinson and Smyth [11] employed a technique they called quantileadjusted conditional maximum likelihood to find an unbiased estimate for the raw SCV. The quantile adjustment refers to a rankbased procedure that modifies the data such that the data seem to stem from samples of equal library size. In DESeq, differing library sizes are simply addressed by linear scaling (Equations (2) and (3)), suggesting that quantile adjustment is an unnecessary complication. The price we pay for this is that we need to make the approximation that the sum of NB variables in Equation (10) is itself NB distributed. While it seems that neither the quantile adjustment nor our approximation pose reason for concern in practice, DESeq's approach is computationally faster and, perhaps, conceptually simpler.
Fourth, our approach provides useful diagnostics. Plots such as Supplementary Figure S3 in Additional file 2 are helpful to judge the reliability of the tests. In Figure 1b and 7, it is easy to see at which mean value biological variability dominates over shot noise; this information is valuable to decide whether the sequencing depth or the number of biological replicates is the limiting factor for detection power, and so helps in planning experiments. A heatmap as in Figure 5 is useful for data quality control.
Materials and methods
The R package DESeq
We implemented the method as a package for the statistical environment R [27] and distribute it within the Bioconductor project [28]. As input, it expects a table of count data. The data, as well as metadata, such as sample and gene annotation, are managed with the S4 class CountDataSet, which is derived from eSet, Bioconductor's standard data type for tablelike data. The package provides highlevel functions to perform analyses such as shown in Section Application with only a few commands, allowing researchers with little knowledge of R to use it. This is demonstrated with examples in the documentation provided with the package (the package vignette). Furthermore, lowerlevel functions are supplied for advanced users who wish to deviate from the standard work flow. A typical calculation, such as the analyses shown in Section Applications, takes a few minutes of time on a personal computer.
All the analyses presented here have been performed with DESeq. Readers wishing to examine them in detail will find a Sweave document with the commented R code of the analysis code as Additional file 2 and the raw data in Additional file 3.
DESeq is available as a Bioconductor package from the Bioconductor repository [28] and from [36].
Abbreviations
 ChIPSeq:

(highthroughput) sequencing of immunoprecipitated chromatin
 ECDF:

empirical cumulative distribution function
 FDR:

falsediscovery rate
 GLM:

generalized linear model
 RNASeq:

(highthroughput) sequencing of RNA
 SCV:

squared coefficient of variation
 NB:

negativebinomial (distribution)
 VST:

variancestabilizing transformation.
Declarations
Acknowledgements
We are grateful to Paul Bertone for sharing the neural stem cells data ahead of publication, and to Bartek Wilczyński, YaHsin Liu, Nicolas Delhomme and Eileen Furlong likewise for sharing the fly RNASeq data. We thank Nicolas Delhomme and Julien Gagneur for helpful comments on the manuscript. S. An. has been partially funded by the European Union Research and Training Network 'Chromatin Plasticity'.
Authors’ Affiliations
References
 Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008, 320: 13441349. 10.1126/science.1158441.PubMedPubMed CentralView ArticleGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5: 621628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
 Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Genomewide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651657. 10.1038/nmeth1068.PubMedView ArticleGoogle Scholar
 Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer AC, Blume JE, Wang X, Darnell JC, Darnell RB: HITSCLIP yields genomewide insights into brain alternative RNA processing. Nature. 2008, 456: 464469. 10.1038/nature07488.PubMedPubMed CentralView ArticleGoogle Scholar
 Smith AM, Heisler LE, Mellor J, Kaper F, Thompson MJ, Chee M, Roth FP, Giaever G, Nislow C: Quantitative phenotyping via deep barcode sequencing. Genome Res. 2009, 19: 18361842. 10.1101/gr.093955.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 15091517. 10.1101/gr.079558.108.PubMedPubMed CentralView ArticleGoogle Scholar
 Wang L, Feng Z, Wang X, Wang X, Zhang X: DEGseq: an R package for identifying differentially expressed genes from RNAseq data. Bioinformatics. 2010, 26: 136138. 10.1093/bioinformatics/btp612.PubMedView ArticleGoogle Scholar
 Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23 (21): 28812887. 10.1093/bioinformatics/btm453.PubMedView ArticleGoogle Scholar
 Whitaker L: On the Poisson law of small numbers. Biometrika. 1914, 10: 3671. 10.1093/biomet/10.1.36.View ArticleGoogle Scholar
 Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139140. 10.1093/bioinformatics/btp616.PubMedPubMed CentralView ArticleGoogle Scholar
 Robinson MD, Smyth GK: Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321332. 10.1093/biostatistics/kxm030.PubMedView ArticleGoogle Scholar
 Cameron AC, Trivedi PK: Regression Analysis of Count Data. 1998, Cambridge University PressView ArticleGoogle Scholar
 Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11: R2510.1186/gb2010113r25.PubMedPubMed CentralView ArticleGoogle Scholar
 Loader C: Local Regression and Likelihood. 1999, SpringerGoogle Scholar
 McCullagh P, Nelder JA: Generalized Linear Models. 1989, Chapman & Hall/CRC, 2View ArticleGoogle Scholar
 locfit: Local regression, likelihood and density estimation. [http://cran.rproject.org/web/packages/locfit/]
 Agresti A: Categorical Data Analysis. 2002, Wiley, 2View ArticleGoogle Scholar
 Engström P, Tommei D, Stricker S, Smith A, Pollard S, Bertone P: Transcriptional characterization of glioblastoma stem cell lines using tag sequencing. 2010Google Scholar
 Morrissy AS, Morin RD, Delaney A, Zeng T, McDonald H, Jones S, Zhao Y, Hirst M, Marra MA: Nextgeneration tag sequencing for cancer gene expression profiling. Genome Res. 2009, 19: 18251835. 10.1101/gr.094482.109.PubMedPubMed CentralView ArticleGoogle Scholar
 Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, Hong MY, Karczewski KJ, Huber W, Weissman SM, Gerstein MB, Korbel JO, Snyder M: Variation in transcription factor binding among humans. Science. 2010, 328: 232235. 10.1126/science.1183621.PubMedPubMed CentralView ArticleGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995, 57: 289300.Google Scholar
 Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010, 11: 9410.1186/147121051194.PubMedPubMed CentralView ArticleGoogle Scholar
 Bloom JS, Khan Z, Kruglyak L, Singh M, Caudy AA: Measuring differential gene expression by short read sequencing: quantitative comparison to 2channel gene expression microarrays. BMC Genomics. 2009, 10: 22110.1186/1471216410221.PubMedPubMed CentralView ArticleGoogle Scholar
 Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, R Irizarry WH. 2005, New York: Springer, 397420. full_text.View ArticleGoogle Scholar
 Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3PubMedGoogle Scholar
 Lönnstedt I, Speed T: Replicated microarray data. Stat Sin. 2002, 12: 3146.Google Scholar
 R: A Language and Environment for Statistical Computing. [http://www.Rproject.org]
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R8010.1186/gb2004510r80.PubMedPubMed CentralView ArticleGoogle Scholar
 Bliss CI, Fisher RA: Fitting the negative binomial distribution to biological data. Biometrics. 1953, 9: 176200. 10.2307/3001850.View ArticleGoogle Scholar
 Clark SJ, Perry JN: Estimation of the negative binomial parameter κ by maximum quasilikelihood. Biometrics. 1989, 45: 309316. 10.2307/2532055.View ArticleGoogle Scholar
 Lawless JF: Negative binomial and mixed Poisson regression. Can J Stat. 1987, 15: 209225. 10.2307/3314912.View ArticleGoogle Scholar
 Saha K, Paul S: Biascorrected maximum likelihood estimator of the negative binomial dispersion parameter. Biometrics. 2005, 61: 179285. 10.1111/j.0006341X.2005.030833.x.PubMedView ArticleGoogle Scholar
 Fast and accurate computation of binomial probabilities. (Note: This is a copy of the original paper, which is no longer available online.), [http://projects.scipy.org/scipy/rawattachment/ticket/620/loader2000Fast.pdf]
 Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R2510.1186/gb2009103r25.PubMedPubMed CentralView ArticleGoogle Scholar
 HTSeq: Analysing highthroughput sequencing data with Python. [http://wwwhuber.embl.de/users/anders/HTSeq/]
 DESeq. [http://wwwhuber.embl.de/users/anders/DESeq]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.