 Method
 Open Access
 Published:
voom: precision weights unlock linear model analysis tools for RNAseq read counts
Genome Biology volume 15, Article number: R29 (2014)
Abstract
New normal linear modeling strategies are presented for analyzing read counts from RNAseq experiments. The voom method estimates the meanvariance relationship of the logcounts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNAseq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than countbased RNAseq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.
Background
Gene expression profiling is one of the most commonly used genomic techniques in biological research. For most of the past 16 years or more, DNA microarrays have been the premier technology for genomewide gene expression experiments, and a large body of mature statistical methods and tools has been developed to analyze intensity data from microarrays. This includes methods for differential expression analysis [1–3], random effects [4, 5], gene set enrichment [6], gene set testing [7, 8] and so on. One popular differential expression pipeline is that provided by the limma software package [9]. The limma pipeline includes linear modeling to analyze complex experiments with multiple treatment factors, quantitative weights to account for variations in precision between different observations, and empirical Bayes statistical methods to borrow strength between genes.
Borrowing information between genes is a crucial feature of the genomewide statistical methods, as it allows for genespecific variation while still providing reliable inference with small sample sizes. The normalbased empirical Bayes statistical procedures can adapt to different types of datasets and can provide exact type I error rate control even for experiments with a small number of replicate samples [3].
In the past few years, RNAseq has emerged as a revolutionary new technology for expression profiling [10]. One common approach to summarize RNAseq data is to count the number of sequence reads mapping to each gene or genomic feature of interest [11–14]. RNAseq profiles consist therefore of integer counts, unlike microarrays, which yield intensities that are essentially continuous numerical measurements. A number of early RNAseq publications applied statistical methods developed for microarrays to analyze RNAseq read counts. For example, the limma package has been used to analyze logcounts after normalization by sequencing depth [11, 15–17].
Later statistical publications argued that RNAseq data should be analyzed by statistical methods designed specifically for counts. Much interest has focused on the negative binomial (NB) distribution as a model for read counts, and especially on the problem of estimating biological variability for experiments with small numbers of replicates. One approach is to fit a global value or global trend to the NB dispersions [13, 18, 19], although this has the limitation of not allowing for genespecific variation. A number of empirical Bayes procedures have been proposed to estimate the genewise dispersions [20–22]. Alternatively, Lund et al. [23] proposed that the residual deviances from NB generalized linear models be entered into limma’s empirical Bayes procedure to enable quasilikelihood testing. Other methods based on overdispersed Poisson models have also been proposed [24–26].
Unfortunately, the mathematical theory of count distributions is less tractable than that of the normal distribution, and this tends to limit both the performance and the usefulness of the RNAseq analysis methods. One problem relates to error rate control with small sample sizes. Despite the use of probabilistic distributions, all the statistical methods developed for RNAseq counts rely on approximations of various kinds. Many rely on the statistical tests that are only asymptotically valid or are theoretically accurate only when the dispersion is small. All the differential expression methods currently available based on the NB distribution treat the estimated dispersions as if they were known parameters, without allowing for the uncertainty of estimation, and this leads to statistical tests that are overly liberal in some situations [27, 28]. This is true even of the NB exact test [18], which gives exact type I error rate control when the dispersion is known but which becomes liberal when an imprecise dispersion estimator is inserted for the known value. Quasilikelihood methods [23] account for uncertainty in the dispersion by using an Ftest in place of the usual likelihood ratio test, but this relies on other approximations, in particular that the residual deviances are analogous to residual sums of squares from a normal analysis of variance.
A related issue is the ability to adapt to different types of data with high or low dispersion heterogeneity. None of the empirical Bayes methods based on the NB distribution achieve the same adaptability, robustness or small sample properties as the corresponding methods for microarrays, due to the mathematical intractability of count distributions compared to the normal distribution.
The most serious limitation though is the reduced range of statistical tools associated with count distributions compared to the normal distribution. This is more fundamental than the other problems because it limits the types of analyses that can be done. Much of the statistical methodology that has been developed for microarray data relies on use of the normal distribution. For example, we often find it useful in our own microarray gene expression studies to estimate empirical quality weights to downweight poor quality RNA samples [29], to use random effects to allow for repeated measures on the same experimental units [4, 5] or to conduct gene set tests for expression signatures while allowing for intergene correlations [7, 8]. These techniques broaden the range of experimental designs that can be analyzed or offer improved interpretation for differential expression results in terms of higher level molecular processes. None of these techniques are currently available for RNAseq analysis using count distributions.
For these reasons, the purpose of this article is to revisit the idea of applying normalbased microarraylike statistical methods to RNAseq read counts. An obstacle to applying normalbased statistical methods to read counts is that the counts have markedly unequal variabilities, even after logtransformation. Large counts have much larger standard deviations than small counts. While a logarithmic transformation counteracts this, it overdoes the adjustment somewhat so that large logcounts now have smaller standard deviations than small logcounts. We explore the idea that it is more important to model the meanvariance relationship correctly than it is to specify the exact probabilistic distribution of the counts. There is a body of theory in the statistical literature showing that correct modeling of the meanvariance relationship inherent in a data generating process is the key to designing statistically powerful methods of analysis [30]. Such variance modeling may in fact take precedence over identifying the exact probability law that the data values follow [31–33]. We therefore take the view that it is crucial to understand the way in which the variability of RNAseq read counts depends on the size of the counts. Our work is in the spirit of pseudolikelihoods [32] whereby statistical methods based on the normal distribution are applied after estimating a meanvariance function for the data at hand.
Our approach is to estimate the meanvariance relationship robustly and nonparametrically from the data. We work with logcounts normalized for sequence depth, specifically with logcounts per million (logcpm). The meanvariance is fitted to the genewise standard deviations of the logcpm as a function of average logcount. We explore two ways to incorporate the meanvariance relationship into the differential expression analysis. The first is to modify limma’s empirical Bayes procedure to incorporate a meanvariance trend. The second method incorporates the meanvariance trend into a precision weight for each individual normalized observation. The normalized logcounts and associated precision weights can then be entered into the limma analysis pipeline, or indeed into any statistical pipeline for microarray data that is precision weight aware. We call the first method limmatrend and the second method voom, an acronym for ‘variance modeling at the observational level’. limmatrend applies the meanvariance relationship at the gene level whereas voom applies it at the level of individual observations.
This article compares the performance of the limmabased pipelines to edgeR [20, 34], DESeq [13], baySeq [21], TSPM [25], PoissonSeq [26] and DSS [22], all of which are based on NB or overdispersed Poisson distributions. Simulation studies show that the limma pipelines perform at least as well in terms of power and error rate control as the NB or Poisson methods even when the data is generated according to the probabilistic assumptions of the earlier methods. A key advantage of the limma pipelines is that they provide accurate type I error rate control even when the number of RNAseq samples is small. The NB and Poisson based methods either fail to control the error correctly or are excessively conservative. limmatrend and voom perform almost equally well when the sequencing depths are the same for each RNA sample. When the sequencing depths are different, voom is the clear best performer.
Either voom or limmatrend give RNAseq analysts immediate access to many techniques developed for microarrays that are not otherwise available for RNAseq, including all the quality weighting, random effects and gene set testing techniques mentioned above. This article presents two case studies that demonstrate how voom can handle heterogeneous data and complex experiments as well as facilitating pathway analysis and gene set testing.
Results
Counts per million: a simple interpretable scale for assessing differential expression
We suppose that RNAseq profiles (or libraries) are available for a set of n RNA samples. Each profile records the number of sequence reads from that sample that have been mapped to each one of G genomic features. A genomic feature can be any predefined subset of the transcriptome, for example a transcript, an exon or a gene. For simplicity, we will assume throughout this article that reads have been summarized by gene, so that the RNAseq profiles give the number of reads from each sample that have been mapped to each gene. Typically G is large, in the tens of thousands or more, whereas n can be as low as three. The total number of mapped reads (library size) for each sample might vary from a few hundred thousand to hundreds of millions. This is the same context as assumed by a number of previous articles [13, 18, 20, 21, 34].
The number of reads observed for a given gene is proportional not just to the expression level of the gene but also to its gene transcript length and to the sequencing depth of the library. Dividing each read count by the corresponding library size (in millions) yields counts per million (cpm), a simple measure of read abundance that can be compared across libraries of different sizes. Standardizing further by transcript length (in kilobases) gives rise to reads per kilobase per million (rpkm), a wellaccepted measure of gene expression [35]. In this article we will work with the simpler cpm rather than rpkm, because we are interested in relative changes in expression between conditions rather than absolute expression.
This article treats logcpm as analogous to logintensity values from a microarray experiment, with the difference that logcpm values cannot be treated as having constant variances. Differences in logcpm between samples can be interpreted as logfoldchanges of expression. The counts are augmented by a small positive value (a half of one read) to avoid taking the logarithm of zero. This ensures no missing logcpm values and reduces the variability at low count values.
Logcpms have stabilized variances at high counts
Probability distributions for counts are naturally heteroscedastic, with larger variances for larger counts. It has previously been argued that the meanvariance relationship for RNAseq counts should be approximately quadratic [34]. This leads to the conclusion that the coefficient of variation (CV) of RNAseq counts should be a decreasing function of count size for small to moderate counts but for larger counts should asymptote to a value that depends on biological variability. Specifically, the squared CV of the counts should be roughly
where λ is the expected size of the count and ϕ is a measure of biological variation [34]. The first term arises from the technical variability associated with sequencing, and gradually decreases with expected count size, while biological variation remains roughly constant. For large counts, the CV is determined mainly by biological variation.
A simple linearization calculation suggests that the standard deviation of the logcpm should be approximately equal to the CV of the counts (see Materials and methods). Examination of a wide range of real datasets confirms these expectations. For studies where the replicates are entirely technical in nature, the standard deviation of the logcpm decreases steadily as a function of the mean (Figure 1a). For studies where the replicates are genetically identical mice, the standard deviation asymptotes at a moderate level corresponding to a biological CV of about 10% (Figure 1b). Studies where the replicates are unrelated human individuals show greater biological variation. For these studies, the standard deviation asymptotes early and at a relatively high level (Figure 1d).
We conclude that logcpm values generally show a smoothly decreasing meanvariance trend with count size, and that the logcpm transformation roughly detrends the variance of the RNAseq counts as a function of count size for genes with larger counts.
Using logcpm in a limma pipeline
A simple approach to analyzing RNAseq data would be to input the logcpm values into a wellestablished microarray analysis pipeline such as that provided by the limma software package [3, 9]. This would be expected to behave well if the counts were all reasonably large, but it ignores the meanvariance trend for lower counts. The microarray pipeline should behave better if modified to include a meanvariance trend as part of the variance modeling. We have therefore modified the empirical Bayes procedure of the limma package so that the genewise variances are squeezed towards a global meanvariance trend curve instead of towards a constant pooled variance. This is similar in principle to the procedure proposed by Sartor et al. [36] for microarray data, except that we model the trend using a regression spline and our implementation allows for the possibility of missing values or differing residual degrees of freedom between genes. We call this strategy limmatrend, whereby the logcpm values are analyzed as for microarray data but with a trended prior variance. For comparison, the more naive approach without the meanvariance trend will be called limmanotrend.
voom: variance modeling at the observationlevel
The limmatrend pipeline models the variance at the gene level. However, in RNAseq applications, the count sizes may vary considerably from sample to sample for the same gene. Different samples may be sequenced to different depths, so different count sizes may be quite different even if the cpm values are the same. For this reason, we wish to model the meanvariance trend of the logcpm values at the individual observation level, instead of applying a genelevel variability estimate to all observations from the same gene.
Our strategy is to estimate nonparametrically the meanvariance trend of the logged read counts and to use this meanvariance relationship to predict the variance of each logcpm value. The predicted variance is then encapsulated as an inverse weight for the logcpm value. When the weights are incorporated into a linear modeling procedure, the meanvariance relationship in the logcpm values is effectively eliminated.
A technical difficulty is that we want to predict the variances of individual observations although there is, by definition, no replication at the observational level from which variances could be estimated. We work around this inconvenience by estimating the meanvariance trend at the gene level, then interpolating this trend to predict the variances of individual observations (Figure 2).
The algorithm proceeds as follows. First, genewise linear models are fitted to the normalized logcpm values, taking into account the experimental design, treatment conditions, replicates and so on. This generates a residual standard deviation for each gene (Figure 2a). A robust trend is then fitted to the residual standard deviations as a function of the average logcount for each gene (Figure 2b).
Also available from the linear models is a fitted value for each logcpm observation. Taking the library sizes into account, the fitted logcpm for each observation is converted into a predicted count. The standard deviation trend is then interpolated to predict the standard deviation of each individual observation based on its predicted count size (Figure 2c). Finally, the inverse squared predicted standard deviation for each observation becomes the weight for that observation.
The logcpm values and associated weights are then input into limma’s standard differential expression pipeline. Most limma functions are designed to accept quantitative weights, providing the ability to perform microarraylike analyses while taking account of the meanvariance relationship of the logcpm values at the observation level.
voom and limmatrend control the type I error rate correctly
We have found that voom and limmatrend, especially voom, perform well and produce P values that control error rates correctly over a wide range of simulation scenarios. For illustration, we present results from simulations in which read counts were generated under the same NB model as assumed by a number of existing RNAseq analysis methods. These simulations should represent the ideal for the NBbased methods. If the normalbased methods can give performance comparable to or better than countbased methods in these simulations, then this is strong evidence that they will be competitive across a wide range of data types.
Six RNAseq count libraries were simulated with counts for 10,000 genes. The first three libraries were treated as group 1 and the others as group 2. The distribution of cpm values for each library was simulated to match the distribution that we observed for a real RNAseq dataset from our own practice. The NB dispersion ϕ was set to decrease on average with expected count size, asymptoting to 0.2 squared for large counts. This degree of biological variation is representative of what we observe for real RNAseq data, being larger than we typically observe between genetically identical laboratory mice but less than we typically see between unrelated human subjects (Figure 1). An individual dispersion ϕ was generated for each gene around the trend according to an inverse chisquare distribution with 40 degrees of freedom. The voom meanvariance trend for one such simulated dataset is shown in Figure 1c. It can be seen from Figure 1 that the simulated dataset is intermediate between the mouse data (Figure 1b) and the human data (Figure 1d) both in terms of the absolute size of the dispersions and in terms of heterogeneity of the dispersions between genes.
We found that variation in sequencing depth between libraries had a noticeable impact on some RNAseq analysis methods. Hence all the simulations were repeated under two library size scenarios, one with the same sequencing depth for all six libraries and one with substantial variation in sequencing depth. In the equal size scenario, all libraries were simulated to contain 11 million reads. In the unequal size scenario, the oddnumbered libraries were simulated to have a sequence depth of 20 million reads while the evennumbered libraries had a sequence depth of 2 million reads. Hence the same total number of reads was simulated in this scenario but distributed unevenly between the libraries.
In the first set of simulations, we examined the ability of voom and limmatrend to control the type I error rate correctly in the absence of any genuine differential expression between the groups. When there are no truly differentially expressed genes, the genewise P values should follow an approximate uniform distribution. If the type I error rate is controlled correctly, then the expected proportion of P values below any cutoff should be less than or equal to the cutoff value. A number of popular RNAseq analysis methods based on the NB or Poisson distributions were included for comparison. Figure 3 shows results for a P value cutoff of 0.01. Results for other cutoffs are qualitatively similar. None of the NB or Poissonbased methods were found to control the type I error rate very accurately. When the library sizes are equal, the NB and Poisson methods were overly liberal, except for DESeq which is very conservative. When the library sizes are unequal, DSS and DESeq became extremely conservative. By contrast, all the normalbased methods were slightly conservative. voom produces results very close to the nominal type I error rate for both library size scenarios. limmatrend is similar to voom when the library sizes are equal but somewhat conservative when the library sizes are unequal.
baySeq was not included in the type I error rate comparison because it does not return P values. However, the results presented in the next section show that it is relatively conservative in terms of the false discovery rate (FDR) (Figure 4).
To check voom’s conservativeness on real data, we used a set of four replicate libraries from the SEQC Project [37]. All four libraries were Illumina HiSeq 2000 RNAseq profiles of samples of Ambion’s Human Brain Reference RNA (HBRR) [38]. We split the four libraries into two groups in all possible ways, and tested for differential expression between the two groups for each partition. voom returned no differentially expressed (DE) genes at 5% FDR for six out of the seven possible partitions, indicating good error rate control. The voom meanvariance trend for the SEQC data, using all the libraries rather than the HBRR samples only, is shown in Figure 1a.
voom has the best power of methods that control the type I error rate
Next we examined the power to detect true differential expression. For the following simulations, 100 randomly selected genes were twofold upregulated in the first group and another 100 were twofold upregulated in the second group. This represents a typical scenario for a functional genomics experiment in which the differential expression effects are large enough to be biologically important but nevertheless sufficiently subtle as to challenge many analysis methods. Figure 4 shows the number of true and false discoveries made by various analysis methods at significance cutoff FDR <0.1. When the library sizes are equal, voom and limmatrend have the next best power after edgeR and PoissonSeq. However, both edgeR and PoissonSeq give empirical FDRs greater than 0.1, confirming the results of the previous section that these methods are somewhat liberal. limmatrend gives an empirical FDR slightly greater than voom but still less than 0.1. With unequal library sizes, voom has the best power except for edgeR while still maintaining a low FDR. TSPM declares by far the most DE genes, but these are mostly false discoveries. DSS also gives a worryingly high rate of false discoveries when the library sizes are unequal. Figures 3 and 4 together show that voom has the best power of those methods that correctly control the type I and FDR error rates.
voom has the lowest false discovery rate
Next we compared methods from a gene ranking point of view, comparing methods in terms of the number of false discoveries for any given number of genes selected as DE. Methods that perform well will rank the truly DE genes in the simulation ahead of nonDE genes. Genes were ranked by posterior likelihood for baySeq and by P value for the other methods. The results show that voom has the lowest FDR at any cutoff (Figure 5). When the library sizes are equal, limmatrend and PoissonSeq are very close competitors (Figure 5a). When the library sizes are unequal, limmatrend and edgeR are the closest competitors (Figure 5b).
Next we compared FDRs using spikein control transcripts from the SEQC project [39]. The data consists of eight RNAseq libraries, in two groups of four. A total of 92 artificial control transcripts were spikedin at different concentrations in such a way that three quarters of the transcripts were truly DE and the remaining quarter were not. To make the spikeins more like a realistic dataset, we replicated the counts for each of the 23 nonDE transcripts three times. That is, we treated each nonDE transcript as three different transcripts. This resulted in a dataset of 138 transcripts with half DE and half nonDE. Figure 6 is analogous to Figure 5 but using the spikein data instead of simulated data. voom again achieved the lowest FDR, with edgeR and the other limma methods again being the closest competitors (Figure 6).
voom and limmatrend are faster than specialist RNAseq methods
The different statistical methods compared varied considerably in computational time required, with DESeq, TSPM and baySeq being slow enough to limit the number of simulations that were done. voom is easily the fastest of the methods compared, with edgeRclassic next fastest (Figure 7).
RNAseq profiles of male and female Nigerian individuals
So far we have demonstrated the performance of voom on RNAseq datasets with small numbers of replicate libraries. To demonstrate the performance of voom on a heterogeneous dataset with a relatively large number of replicates and a high level of biological variability, we compared males to females using RNAseq profiles of lymphoblastoid cell lines from 29 male and 40 female unrelated Nigerian individuals [40]. Summarized read counts and gene annotation are provided by the Bioconductor tweeDEseqCountData package [41]. Figure 1d shows the voom meanvariance trend of this dataset.
voom yielded 16 genes upregulated in males and 43 upregulated in females at 5% FDR. As expected, most of the top differentially expressed genes belonged to the X or Y sex chromosomes (Table 1). The top gene is XIST, which is a key player in X inactivation and is known to be expressed at meaningful levels only in females.
We examined 12 particular genes that are known to belong to the malespecific region of chromosome Y [42, 43]. A ROAST gene set test confirmed that these genes collectively are significantly upregulated in males (P=0.0001). A CAMERA gene set test was even more convincing, confirming that these genes are significantly more upregulated in males than are other genes in the genome (P=2×10^{28}).
We also examined 46 X chromosome genes that have been reported to escape X inactivation [43, 44]. These genes were significantly upregulated in females (ROAST P=0.0001, CAMERA P=10^{10}). The logfoldchanges for the X and Y chromosome genes involved in the gene set tests are highlighted on an MA plot (Figure 8).
Note that these gene set testing approaches are not available for any of the countbased approaches to differential expression. If a countbased method had been used to assess differential expression, we could still have examined whether sexlinked genes were highly ranked among the differentially expressed genes, but we could not have undertaken any formal statistical test for enrichment of this signature while accounting for intergene correlation. On the other hand, the voom expression values and weights are suitable for input into the ROAST and CAMERA procedures without any further processing.
Development stages of Drosophila melanogaster
Like edgeRglm, but unlike most other analysis tools, voom and limmatrend offer fullfeatured linear modeling for RNAseq data, meaning that they can analyze arbitrary complex experiments. The possibilities of linear modeling are so rich that it is impossible to select a representative example. voom and limma could be used to analyze any genelevel RNAseq differential expression experiment, including those with multiple experimental factors [34]. Here we give a novel analysis illustrating the use of quadratic regression to analyze a timecourse study.
RNAseq was used to explore the developmental transcriptome of Drosophila melanogaster[45]. RNAseq libraries were formed from wholeanimal samples to represent a large number of distinct developmental stages. In particular, samples were collected from embryonic animals at equispaced development stages from 2 hours to 24 hours in 2hour intervals. Here we analyze the 12 RNAseq libraries from these embryonic stages. We sought to identify those genes that are characteristic of each embryonic stage. In particular we wished to identify, for each embryonic stage, those genes that achieve their peak expression level during that stage.
As all the samples are from distinct stages, there are no replicate libraries in this study. To estimate variances we utilized the fact that gene expression should for most genes vary smoothly over time. A multidimensional scaling plot of logcpm values shows the gradual change in gene expression during embryonic development, with each stage intermediate in expression profile between the stages before and after (Figure 9). We used genewise linear models to fit a quadratic trend with time to the logcpm values for each gene. These quadratic trends will not match all the intricacies of gene expression changes over time but are sufficient to model the major trends. The voom meanvariance trend for this data is shown in Figure 1e.
Out of 14,869 genes that were expressed during embryonic development, 8,366 showed a statistically significant trend at 5% FDR using empirical Bayes Ftests. For each differentially expressed gene, we identified the embryonic stage at which the fitted quadratic trend achieved its maximum value. This allowed us to associate each significant gene with a particular development stage (Figure 10). Most genes peaked at the first or last stage (Figure 10), indicating smoothly decreasing or increasing trends over time (Figure 11, panels 1 and 12). Genes peaking at the first embryonic stage tended to be associated with the cell cycle. Genes peaking at the last stage tended to be associated with precursor metabolites and energy, the oxidationreduction process and with metabolic pathways.
Genes peaking at intermediate stages have expression trends with an inverseU shape (Figure 11, panels 2–11). There was a substantial set of genes with peak activity between 12–16 hours of embryonic development (Figure 10), suggesting some important developmental change occurring during this period requiring the action of specialpurpose genes. Indeed, gene ontology analysis of the genes associated with this period showed that anatomical structure morphogenesis was the most significantly enriched biological process. Other leading terms were organ morphogenesis and neuron differentiation.
This analysis demonstrates a simple but effective means of identifying genes that have a particular role at each developmental stage.
Discussion
This article follows the common practice of examining differential expression on a genewise basis. Our preferred practice is to count the total number of reads overlapping annotated exons for each genes. While this approach does not allow for the possibility that different isoforms of the same gene may be differentially expressed in different directions, it does provide a statistically robust genelevel analysis even when the sequencing depths are quite modest. The relevance of genelevel analyses is also supported by recent surveys of transcription, which have shown that each gene tends to have a dominant isoform that accounts for far more of the total expression for that gene than any of the remaining isoforms [46, 47]. The voom analysis can also be conducted at the exon level instead of at the gene level as an aid to detecting alternative splicing between the treatment groups.
In this article, voom has been applied to logcpm values. voom can work, however, just as easily with logged rpkm values in place of logcpm, because the precision weights are the same for both measures. If the genomic length of each gene is known, then the logcpm values output by voom can be converted to logrpkm by subtracting the log2 gene length in kilobases. The downstream analysis is unchanged and will yield identical results in terms of differentially expressed genes and estimated foldchanges.
This article has shown that a normalbased analysis of RNAseq read count data performs surprisingly well relative to methods that use specialpurpose count distributions. The motivation for examining normalbased methods was to open up access to a range of microarraylike analysis tools based on the normal distribution. From this point of view, the normalbased methods only need to perform comparably to the countbased methods in terms of power and FDR control in order to be a success. Our comparisons suggest not only that this is so, but that the normalbased methods actually have a performance advantage. We found voom to be the best performer across our simulations and comparisons, and even the simpler limmatrend method performed equal or better than the countbased methods. voom and limmatrend perform almost equally when the library sizes are equal, but voom has the advantage when the library sizes are unequal. The best performing countbased methods were edgeR and PoissonSeq, although neither of those methods controlled the type I error rate at the nominal level, both being somewhat liberal. The performance advantage of voom over many of the countbased methods was quite substantial in our simulations, despite the simulations being conducted under the same NB distributional assumptions as made by a number of existing methods. Other simulation scenarios would tend to increase voom’s advantage. For example, it would be at least as scientifically reasonable to assume that the true expression levels for each gene follow a lognormal distribution between replicates instead of a gamma distribution, and such an assumption would tend to improve the performance of voom relative to edgeR, DESeq, baySeq and DSS. In general, voom makes fewer distributional assumptions than do competing methods and can therefore be expected to perform robustly across a range of scenarios. This study presented simulations with equal library sizes between replicates, and also explored the sensitivity of the methods to unequal library sizes. In our experience markedly unequal library sizes can arise in real RNAseq experiments for a variety of reasons. One scenario is when an experiment is conducted in stages and samples sequenced at a later time have a much higher sequencing depth. Other possible scenarios occur when technical replicates are combined for a subset of samples or when DNA samples are multiplexed onto a sequencing lane in unequal quantities. Some of the NBbased analysis methods become very conservative or showed very poor FDR control when the library sizes were unequal. In contrast, voom shows consistent performance in all scenarios.
The worst performer in our simulation was TSPM, presumably because we have simulated from NB distributions, which have quadratic meanvariance relationships, whereas TSPM assumes a linear meanvariance relationship [25]. The second worst performer was the ordinary ttest. This shows that traditional statistical methods cannot be reliably applied to genomic data without borrowing strength between genes. The third worst performer was limmanotrend, showing that the meanvariance trend in the logcpm values cannot be ignored.
To examine sensitivity of the results to the shape of the dispersion distribution, we repeated all the simulations using a lognormal distribution for the genewise dispersions instead of an inverse chisquare distribution. The two distributions were chosen to have the same mean and variance on the logscale. The results were virtually unchanged from those shown in Figures 3, 4 and 5, showing that the shape of the dispersion distribution is not a major determination of performance. This agrees with a similar conclusion in Wu et al. [22].
It may seem surprising at first that voom should perform so well even though it ignores the discrete integer nature of the counts. We think there are several possible reasons for this. First, the parametric advantages of the Poisson or NB distributions are mitigated by the fact that the observed meanvariance relationship of RNAseq data does not perfectly match the theoretical meanvariance relationships inherent in these distributions. While the quadratic meanvariance relationship of the NB distribution captures most of the meanvariance trend, the NB dispersion still shows a nonignorable trend with gene abundance [13, 19, 34]. This means that the meanvariance relationship still has to be estimated nonparametrically, at least in part.
Second, voom is more precise than previous methods in terms of its treatment of the meanvariance trend. While several previous methods fit a semiparametric trend to the variances or to the NB dispersions [13, 19, 23, 34], the trend has always been used to estimate genelevel model parameters. This ignores the fact that different counts for the same gene may vary substantially in size, meaning that the trend should be applied differently to different observations. This consideration becomes more critical when different RNA samples are sequenced to different depths.
Third, the use of normal models gives voom access to tractable empirical Bayes distribution theory [3], facilitating reliable estimation of the Bayesian hyperparameters and exact small sample distributions for the test statistics. Amongst other things this facilitates accurate estimate of the prior degrees of freedom determining the optimal amount of squeezing to be applied to the variances.
Fourth, the use of normal distribution approximations in conjunction with variance modeling is partly supported by generalized linear model theory. Rao’s score test [48] for a covariate in a generalized linear model is essentially equivalent to the normal theory test statistic, provided that the meanvariance function is correctly estimated and incorporated into appropriate precision weights [49]. Score tests have similar performance to likelihood ratio tests when the null hypothesis is true or when the changes being detected are relatively small.
Some of the countbased methods have been criticized as being sensitive to outlier counts [28]. The voom and limmatrend methods inherit good robustness properties from the normalbased procedures in limma [28]. If necessary, they can be made extremely robust to outliers and hypervariable genes using the robust empirical Bayes options of the limma package [50].
In addition to performance results, voom has a number of qualitative advantages over the countbased methods. It is fast and convenient. It allows RNAseq and microarray data to be analyzed in closely comparable ways, which may be an attraction for analysts comparing results from the two technologies. It gives access to a wealth of statistical methods developed for microarrays, including for example the gene set testing methods demonstrated on the Nigerian dataset.
Conclusions
voom performs as well or better than existing RNAseq methods, especially when the library sizes are unequal. It is moreover faster and more convenient, and converts RNAseq data into a form that can be analyzed using similar tools as for microarrays.
Materials and methods
Logcounts per million
We assume that an experiment has been conducted to generate a set of n RNA samples. Each RNA sample has been sequenced, and the sequence reads have been summarized by recording the number mapping to each gene. The RNAseq data consist therefore of a matrix of read counts r_{ g i }, for RNA samples i=1 to n, and genes g=1 to G. Write R_{ i } for the total number of mapped reads for sample i:
We define the logcounts per million (logcpm) value for each count as:
The counts are offset away from zero by 0.5 to avoid taking the log of zero, and to reduce the variability of logcpm for low expression genes. The library size is offset by 1 to ensure that (r_{ gi }+0.5)/(R_{ i }+1) is strictly less than 1 as well as strictly greater than zero.
Delta rule for logcpm
Write λ=E(r) for the expected value of a read count given the experimental conditions, and suppose that:
where ϕ is a dispersion parameter. If r is large, then the logcpm value of the observation is:
where R is the library size. Note that the analysis is conditional on R, so R is treated as a constant. It follows that var(y)≈var(log2(r). If λ also is large, then:
by Taylor’s theorem [51], so:
Linear models
This article develops differential expression methods for RNAseq experiments of arbitrary complexity, for example experiments with multiple treatment factors, batch effects or numerical covariates. As has been done previously [3, 7, 8, 34], we use linear models to describe how the treatment factors are assigned to the different RNA samples. We assume that:
where x_{ i } is a vector of covariates and β_{ g } is a vector of unknown coefficients representing log2foldchanges between experimental conditions. In matrix terms:
where y_{ g } is the vector of logcpm values for gene g and X is the design matrix with the x_{ i } as rows. Interest centers on testing whether one or more of the β_{ gj } are equal to zero,
voom variance modeling
The above linear model is fitted, by ordinary least squares, to the logcpm values y_{ gi } for each gene. This yields regression coefficient estimates {\widehat{\beta}}_{g}, fitted values {\widehat{\mu}}_{\mathit{\text{gi}}}={x}_{i}^{T}{\widehat{\beta}}_{g} and residual standard deviations s_{ g }.
Also computed is the average logcpm {\stackrel{\u0304}{y}}_{g} for each gene. The average logcpm is converted to an average logcount value by:
where \stackrel{~}{R} is the geometric mean of the library sizes plus one.
To obtain a smooth meanvariance trend, a LOWESS curve is fitted to squareroot standard deviations {s}_{g}^{1/2} as a function of mean logcounts \stackrel{~}{R} (Figure 2a,b). Squareroot standard deviations are used because they are roughly symmetrically distributed. The LOWESS curve [52] is statistically robust [53] and provides a trend line through the majority of the standard deviations. The LOWESS curve is used to define a piecewise linear function lo() by interpolating the curve between ordered values of \stackrel{~}{R}.
Next the fitted logcpm values {\widehat{\mu}}_{\mathit{\text{gi}}} are converted to fitted counts by:
The function value \text{lo}\left({\widehat{\lambda}}_{\mathit{\text{gi}}}\right) is then the predicted squareroot standard deviation of y_{ gi }.
Finally, the voom precision weights are the inverse variances {w}_{\mathit{\text{gi}}}=\text{lo}{\left({\widehat{\lambda}}_{\mathit{\text{gi}}}\right)}^{4} (Figure 2c). The logcpm values y_{ gi } and associated weights w_{ gj } are then input into limma’s standard linear modeling and empirical Bayes differential expression analysis pipeline.
Gene set testing methods
ROAST [7] and CAMERA [8] are gene set testing procedures that assess changes in the overall expression signature defined by a set of genes. ROAST [7] is a selfcontained test that assesses differential expression of the gene set without regard to genes not in the set. CAMERA [8] is a competitive test that assesses differential expression of the gene set relative to all other genes on the array. Both procedures offer considerable flexibility as they have the ability to test the association of a genomic pathway or gene set signature with quite general treatment comparisons or contrasts defined in the context of a microarray linear model. We have adapted both methods to make use of quantitative weights as output by voom. The revised methods are implemented in the functions roast() and camera() of the limma software package.
Normalization
The logcpm values are by definition normalized for sequencing depth. Other normalization steps can optionally be done. The library sizes R_{ i } can be scale normalized to adjust for compositional differences between the RNAseq libraries [54]. This produces normalized library sizes {R}_{i}^{\ast} that can be used in place of R_{ i } in the voom pipeline. Alternatively, betweenarray normalization methods developed for single channel microarray data, such as quantile or cyclic LOESS, can be are applied to the logcpm values.
Simulations
The simulations were designed to generate data with characteristics similar to real data that we analyze in our own practice. First a set of baseline expression values was generated representing the relative proportion of counts expected to arise from each gene. These proportions were translated into expected count sizes by multiplying by library size, and then multiplied by true foldchanges as appropriate. Counts were then generated following a NB distribution with the specified mean and dispersion for each observation.
The distribution of baseline values was chosen to match that from RNAseq experiments conducted at our institution. Specifically we used the goodTuringProportions function of the edgeR package [12], which implements the GoodTuring algorithm [55], to predict the true proportion of total RNA attributable to each gene. We ran this function on a number of different libraries, pooled the predicted proportions and formed a smoothed distribution function. The baseline proportions for the simulations were then generated to follow this distribution.
The NB dispersions were generated as follows. The trend in the dispersions was set to be ψ_{ gi } with:
where λ_{ gi } is the expected count size. A modest amount of genewise biological variation was generated from an inverse chisquare distribution with 40 degrees of freedom. The individual dispersions were set to be ϕ_{ gi }=ψ_{ gi }δ_{ g } where 40/{\delta}_{g}\sim {\chi}_{40}^{2}.
In an alternative simulation, to investigate sensitivity to the distribution of genewise dispersions, the δ_{ g } were simulated as lognormal with mean 0 and standard deviation 0.25 on the logscale. This produces a distribution with a similar CV as for the inverse chisquare simulation.
For each simulated dataset, genes with less than ten reads across all samples were filtered from the analysis. PoissonSeq resets the seed of the random number generator in R, so it was necessary to save and restore the state of the random number generator before and after each call of the main PoissonSeq function.
Complete runnable code that reproduces all the simulations is provided as Additional file 1. See also the voom website [56].
SEQC data
The SEQC project, also known as MAQCIII, aims to provide a comprehensive study of nextgeneration sequencing technologies [37]. We analyze here a pilot SEQC dataset consisting of 16 RNAseq libraries in four groups. The full SEQC data including the 16 libraries analyzed here will become available as GEO series [GEO:GSE47792] when the main SEQC article is published in 2014. In the meantime, the aligned and summarized read counts for the pilot libraries needed to repeat the analyses in this article are available from the voom webpage [56].
The groups are labeled A–D and are closely analogous to the similarly labeled RNA samples used in the earlier microarray quality control study [57]. Libraries in group A are profiles of Stratagene’s Universal Human Reference RNA (UHRR) with the addition of RNA from Ambion’s ERCC ExFold RNA spikein mix 1 (Mix 1). Libraries in group B are profiles of Ambion’s Human Brain Reference RNA (HBRR) with added RNA from Ambion’s ERCC ExFold RNA spikein mix 2 (Mix 2). RNA samples in groups C and D are mixtures of A and B in the proportions 75:25 and 25:75, respectively. An Illumina HiSeq 2000 was used to create a FastQ file of pairedend sequence reads for each sample. The library size for each sample varied from 5.4 to 8.0 million read pairs. Fragments were mapped to the National Center for Biotechnology Information’s Build 37.2 of the human genome using the Subread aligner [58]. Fragment counts were summarized by Entrez Gene ID using the featureCounts function [59] of version 1.8.2 of the Bioconductor package Rsubread [60]. Fragments with both end reads mapped successfully contributed one count if the fragment overlapped any annotated exon for that gene. Fragments for which only one read mapped successfully contributed half a count if that read overlapped an exon. The summarized read count data is available from the voom webpage [56].
The voom meanvariance trend shown in Figure 1a was obtained from all 16 libraries, treated as four groups. Genes were filtered out if they failed to achieve cpm >1 in at least four libraries, and the remaining logcpm values were quantile normalized between libraries [61].
The comparison between technical replicates to check the type I error rate control used only the four group B libraries. Genes were filtered out if they failed to achieve a cpm >1 in at least two libraries and the logcpm values for the 16,745 remaining genes were quantile normalized. Samples were separated into all possible twoversustwo and threeversusone combinations and a limma analysis using voom weights was carried out for each partition.
The false discovery rate analysis was conducted on the spikein transcripts only. ERCC Mixes 1 and 2 contain 92 transcripts spiked in at different concentrations. For this analysis, fragments were mapped to the known sequences of the spikedin transcripts using Subread. The experiment is designed so that 23 transcripts have the same concentration in Mix 1 and Mix 2. The remaining transcripts were spikedin in such a way that 23 transcripts are fourfold more abundant in Mix 1, 23 are 1.5 higher in Mix 2 and 23 are twofold higher in Mix 2. A majority of the spikein transcripts data are DE. We replicated the counts for each of the 23 nonDE transcripts three times, so that each nonDE transcript was treated as three different transcripts. This resulted in a dataset of 138 transcripts, half DE and half nonDE. Our analysis used read counts for the spikein transcripts only. TMMscale normalization [54] was used for all the analysis methods, except for DESeq and PoissonSeq, which have their own builtin normalization methods. No transcripts were filtered, except by PoissonSeq as its standard analysis includes the removal of probes with low counts. The genes that were filtered out by PoissonSeq were reintroduced to the end of the gene ranking, ordered from largest mean count to lowest mean count.
Lymphoblastoid cell lines from Nigerian individuals
As part of the International HapMap Project, RNA samples were obtained from lymphoblastoid cell lines derived from 69 unrelated Nigerian individuals including 29 males and 40 females [40]. Sequencing was performed using an Illumina Genome Analyzer II. Read counts, summarized by Ensembl gene, and transcript annotations were obtained from version 1.0.9 of the tweeDEseqCountData Bioconductor package [43], specifically from the data objects pickrell1, annotEnsembl63 and genderGenes. Genes were filtered if they failed to achieve a cpm value of 1 in at least 20 libraries. Library sizes were scalenormalized by the TMM method [54] using edgeR software [12] prior to the voom analysis.
Development stages of Drosophila melanogaster
RNAseq was used to explore the developmental transcriptome of D. melanogaster[45]. Mapped read counts are available from the ReCount project [62]. Specifically the pooled version of the modencodefly dataset from the ReCount website [63] provides read counts summarized by Ensembl 61 gene IDs for 30 wholeanimal biological samples. We discarded the larval, pupal and adult stages and kept only the 12 embryonic samples. Genes were retained in the analysis if they achieved cpm >1 for any embryonic stage. Effective library sizes were estimated by TMM scalenormalization [54] using edgeR software [12] prior to the voom analysis.
Gene ontology analysis used the GOstats software package [64] and version 2.9.0 of the org.Dm.eg.db annotation package [65]. All GO terms mentioned in the Results section had Fisher’s exact test P values less than 10^{10}.
C57BL/6J and DBA/2J inbred mouse strains
An RNAseq experiment was carried out to detect differential striatal gene expression between the C57BL/6J (B6) and DBA/2J (D2) inbred mouse strains [66]. Profiles were made for 10 B6 and 11 D2 mice. Mapped read counts summarized by Ensembl 61 gene IDs were downloaded as the bottomly dataset from the ReCount website [63]. Genes were filtered out if they failed to achieve cpm >1 in at least four libraries and the remaining logcpm values were quantile normalized. The limmavoom analysis compared the two strains and included a batch effect correction for the Illumina flow cell in which each sample was sequenced. The voom meanvariance trend is shown in Figure 1b.
Software
The results presented in this article were obtained using R version 3.0.0 and the software packages limma 3.16.2, edgeR 3.2.3, baySeq 1.14.1, DESeq 1.12.0, DSS 1.4.0, PoissonSeq 1.1.2 and tweeDEseqCountData 1.0.8. All of these packages are part of the Bioconductor project [67, 68], except for PoissonSeq, which is part of the Comprehensive R Archive Network [69]. The TSPM function, dated February 2011, was downloaded in March 2013 from the author’s webpage [70].
The voom methodology proposed in the article is implemented in the voom function of the limma package. The limmatrend method was implemented by inputting the logcpm values from voom into limma’s standard pipeline, with trend=TRUE for the eBayes function. Hence the limmatrend pipeline was the same as that for voom except that weights were not used in the linear modeling step and the trend option was turned on for the empirical Bayes step. The limma package can be installed from the Bioconductor project repository [71].
All the countbased packages were used with the default differential expression pipelines as recommended in the software for each package. For edgeR 3.2.3 the default prior degrees of freedom for squeezing the genewise dispersions is 10. Note that this is a change from versions 3.0.X and earlier for which the default had been 20. For DSS the Wald test was used as recommended in the documentation. The DESeq defaults have changed considerably since the original publication. We used the DESeq function estimateDispersions with sharingMode=~maximum~ and fitType=~local~ and conducted tests using nbinomTest.
The different countbased packages implement different methods of compositional normalization [54]. For our simulations, there are no compositional differences between the libraries so there should be no need to estimate compositional normalization factors. For this reason we did not use the calcNormFactors function with edgeR or estimateSizeFactors with DESeq or estNormFactors with DSS. This should tend to improve the performance of the packages and to make them more comparable, as any differences between the packages can be attributed to the statistical procedures rather than to differences between the normalization strategies.
Abbreviations
 CV:

coefficient of variation
 DE:

differentially expressed
 FDR:

false discovery rate
 HBRR:

Ambion’s Human Brain Reference RNA
 logcpm:

logcounts per million
 LOWESS:

locally weighted regression
 NB:

negative binomial
 rpkm:

reads per kilobase per million
 UHRR:

Stratagene’s Universal Human Reference RNA.
References
Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001, 98: 51165121. 10.1073/pnas.091062498.
Wright GW, Simon RM: A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics. 2003, 19: 24482455. 10.1093/bioinformatics/btg345.
Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article 3
Cui X, Hwang JG, Qiu J, Blades NJ, Churchill GA: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005, 6: 5975. 10.1093/biostatistics/kxh018.
Smyth G, Michaud J, Scott H: Use of withinarray replicate spots for assessing differential expression in microarray experiments. Bioinformatics. 2005, 21: 20672075. 10.1093/bioinformatics/bti270.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 1554515550. 10.1073/pnas.0506580102.
Wu D, Lim E, Vaillant F, AsselinLabat M, Visvader J, Smyth G: ROAST rotation gene set tests for complex microarray experiments. Bioinformatics. 2010, 26: 21762182. 10.1093/bioinformatics/btq401.
Wu D, Smyth G: Camera: a competitive gene set test accounting for intergene correlation. Nucleic Acids Res. 2012, 40: e133e133. 10.1093/nar/gks461.
Smyth G: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397420.
Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.
Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, Mckernan KJ, Grimmond SM: Stem cell transcriptome profiling via massivescale mRNA sequencing. Nature Methods. 2008, 5: 613619. 10.1038/nmeth.1223.
Robinson M, McCarthy D, Smyth G: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139140. 10.1093/bioinformatics/btp616.
Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11: R10610.1186/gb20101110r106.
Oshlack A, Robinson MD, Young MD: From RNAseq reads to differential expression results. Genome Biol. 2010, 11: 22010.1186/gb20101112220.
Perkins TT, Kingsley RA, Fookes MC, Gardner PP, James KD, Yu L, Assefa SA, He M, Croucher NJ, Pickard DJ, Maskell DJ, Parkhill J, Choudhary J, Thomson NR, Dougan G: A strandspecific RNAseq analysis of the transcriptome of the typhoid bacillusSalmonella Typhi. PLoS Genet. 2009, 5: e100056910.1371/journal.pgen.1000569.
Han X, Wu X, Chung WY, Li T, Nekrutenko A, Altman NS, Chen G, Ma H: Transcriptome of embryonic and neonatal mouse cortex by highthroughput RNA sequencing. Proc Natl Acad Sci. 1274, 106: 112746.
Parikh A, Miranda ER, KatohKurasawa M, Fuller D, Rot G, Zagar L, Curk T, Sucgang R, Chen R, Zupan B, Loomis WF, Kuspa A, Shaulsky G: Conserved developmental transcriptomes in evolutionarily divergent species. Genome Biol. 2010, 11: R3510.1186/gb2010113r35.
Robinson MD, Smyth GK: Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321332.
Zhou YH, Xia K, Wright FA: A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011, 27: 26722678. 10.1093/bioinformatics/btr449.
Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23: 28812887. 10.1093/bioinformatics/btm453.
Hardcastle TJ, Kelly KA: baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010, 11: 42210.1186/1471210511422.
Wu H, Wang C, Wu Z: A new shrinkage estimator for dispersion improves differential expression detection in RNAseq data. Biostatistics. 2013, 14: 232243. 10.1093/biostatistics/kxs033.
Lund S, Nettleton D, McCarthy D, Smyth G: Detecting differential expression in RNAsequence data using quasilikelihood with shrunken dispersion estimates. Stat Appl Genet Mol Biol. 2012, 11: Article 8
Srivastava S, Chen L: A twoparameter generalized Poisson model to improve the analysis of RNAseq data. Nucleic Acids Res. 2010, 38: e17010.1093/nar/gkq670.
Auer PL, Doerge RW: A twostage Poisson model for testing RNAseq data. Stat Appl Genet Mol Biol. 2011, 10: Article 26
Li J, Witten D, Johnstone I, Tibshirani R: Normalization, testing, and false discovery rate estimation for RNAsequencing data. Biostatistics. 2012, 13: 523538. 10.1093/biostatistics/kxr031.
Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ, Taylor JM: Efficient experimental design and analysis strategies for the detection of differential expression using RNASequencing. BMC Genomics. 2012, 13: 48410.1186/1471216413484.
Soneson C, Delorenzi M: A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013, 14: 9110.1186/147121051491.
Ritchie M, Diyagama D, Neilson J, Van Laar R, Dobrovic A, Holloway A, Smyth G: Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics. 2006, 7: 26110.1186/147121057261.
McCullagh P, Nelder JA: Generalized Linear Models. 1989, Boca Raton: Chapman & Hall/CRC
Wedderburn RWM: Quasilikelihood functions, generalized linear models, and the GaussNewton method. Biometrika. 1974, 61: 439447.
Carroll RJ, Ruppert D: A comparison between maximum likelihood and generalized least squares in a heteroscedastic linear model. J Am Stat Assoc. 1982, 77: 878882. 10.1080/01621459.1982.10477901.
Nelder JA, Pregibon D: An extended quasilikelihood function. Biometrika. 1987, 74: 221232. 10.1093/biomet/74.2.221.
McCarthy DJ, Chen Y, Smyth GK: Differential expression analysis of multifactor RNASeq experiments with respect to biological variation. Nucleic Acids Res. 2012, 40: 42884297. 10.1093/nar/gks042.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nature Methods. 2008, 5: 621628. 10.1038/nmeth.1226.
Sartor MA, Tomlinson CR, Wesselkamper SC, Sivaganesan S, Leikauf GD, Medvedovic M: Intensitybased hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics. 2006, 7: 53810.1186/147121057538.
Sequencing Quality Control (SEQC) Project. [http://www.fda.gov/MicroArrayQC],
Ambion FirstChoice Human Brain Reference RNA. [http://products.invitrogen.com/ivgn/product/AM6050],
Baker SC, Bauer SR, Beyer RP, Brenton JD, Bromley B, Burrill J, Causton H, Conley MP, Elespuru R, Fero M, Foy C, Fuscoe J, Gao X, Gerhold DL, Gilles P, Goodsaid F, Guo X, Hackett J, Hockett RD, Ikonomi P, Irizarry RA, Kawasaki ES, KaysserKranich T, Kerr K, Kiser G, Koch WH, Lee KY, Liu C, Liu ZL, Lucas A, et al: The external RNA controls consortium: a progress report. Nature Methods. 2005, 2: 731734. 10.1038/nmeth1005731.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768772. 10.1038/nature08872.
Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR: A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNAseq experiments. BMC Bioinformatics. 2013, 14: 25410.1186/1471210514254.
Skaletsky H, KurodaKawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T, Chinwalla A, Delehaunty A, Delehaunty K, Du H, Fewell G, Fulton L, Fulton R, Graves T, Hou SF, Latrielle P, Leonard S, Mardis E, Maupin R, McPherson J, Miner T, Nash W, Nguyen C, Ozersky P, Pepin K, Rock S, et al: The malespecific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature. 2003, 423: 825837. 10.1038/nature01722.
Gonzalez JR, Esnaola M: tweeDEseqCountData: RNAseq count data employed in the vignette of the tweeDEseq package. [http://www.bioconductor.org],
Carrel L, Willard HF: Xinactivation profile reveals extensive variability in Xlinked gene expression in females. Nature. 2005, 434: 400404. 10.1038/nature03479.
Graveley BR, Brooks N, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri G, van Baren MJ, Boley N, Booth BW, Brown JB, Cherbas L, Davis CA, Dobin A, Li R, Lin W, Malone JH, Mattiuzzo NR, Miller D, Sturgill D, Tuch BB, Zaleski C, Zhang D, Blanchette M, Dudoit S: The developmental transcriptome ofDrosophila melanogaster. Nature. 2011, 471: 473479. 10.1038/nature09715.
Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Bar NS, Batut P, Bell K, Bell I, Chakrabortty S, Chen X, Chrast J, Curado J, et al: Landscape of transcription in human cells. Nature. 2012, 489: 101108. 10.1038/nature11233.
GonzalezPorta M, Frankish A, Rung J, Harrow J, Brazma A: Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 2013, 14: R7010.1186/gb2013147r70.
Bera AK, Bilias Y: Rao’s score, Neyman’s Cαand Silvey’s LM tests: an essay on historical developments and some new results. J Stat Plann Inference. 2001, 97: 944. 10.1016/S03783758(00)003438.
Pregibon D: Score tests in GLIM with applications. GLIM 82 Proceedings of the International Conference on Generalised Linear Models. Edited by: Gilchrist R. 1982, New York: Springer, 8797.
Phipson B, Lee S, Majewski IJ, Alexander WS, Smyth GK: Empirical Bayes in the presence of exceptional cases, with application to microarray data. 2013, [http://www.statsci.org/smyth/pubs/RobustEBayesPreprint.pdf],
Oehlert GW: A note on the delta method. Am Statistician. 1992, 46: 2729.
Cleveland WS: Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc. 1979, 74: 829836. 10.1080/01621459.1979.10481038.
Oshlack A, Emslie D, Corcoran L, Smyth G: Normalization of boutique twocolor microarrays with a high proportion of differentially expressed probes. Genome Biol. 2007, 8: R210.1186/gb200781r2.
Robinson MD, Oshlack A: A scaling normalization method for differential expression analysis of RNAseq data. Genome Biol. 2010, 11: R2510.1186/gb2010113r25.
Gale WA, Sampson G: GoodTuring frequency estimation without tears. J Quant Linguist. 1995, 2: 217237. 10.1080/09296179508590051.
Law CW, Chen Y, Shi W, Smyth GK: Supplementary information for ‘Voom: precision weights unlock linear model analysis tools for RNAseq read counts’. [http://bioinf.wehi.edu.au/voom],
Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Scherf U, ThierryMieg J, Wang C, Wilson M, Wolber PK, et al: The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurements. Nature Biotechnol. 2006, 24: 11511161. 10.1038/nbt1239.
Liao Y, Smyth GK, Shi W: The Subread aligner: fast, accurate and scalable read mapping by seedandvote. Nucleic Acids Res. 2013, 41: e10810.1093/nar/gkt214.
Liao Y, Smyth GK, Shi W: featureCounts: an efficient generalpurpose read summarization program. Bioinformatics. 2013, [http://bioinformatics.oxfordjournals.org/content/early/2013/11/30/bioinformatics.btt656],
Shi W, Liao Y: Rsubread: a super fast, sensitive and accurate read aligner for mapping nextgeneration sequencing reads. [http://www.bioconductor.org],
Bolstad BM, Irizarry RA, Åstrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185193. 10.1093/bioinformatics/19.2.185.
Frazee AC, Langmead B, Leek JT: ReCount: a multiexperiment resource of analysisready RNAseq gene count datasets. BMC Bioinformatics. 2011, 12: 44910.1186/1471210512449.
Frazee A, Langmead B, Leek J: ReCount: a multiexperiment resource of analysisready RNAseq gene count datasets. [http://bowtiebio.sourceforge.net/recount],
Falcon S, Gentleman R: Using GOstats to test gene lists for GO term association. Bioinformatics. 2007, 23: 257258. 10.1093/bioinformatics/btl567.
Carlson M: org.Dm.eg.db: Genome wide annotation for Fly. [http://www.bioconductor.org],
Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney S K Hitzemann: Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNASeq and microarrays. PLoS One. 2011, 6: e1782010.1371/journal.pone.0017820.
Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth GK, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R8010.1186/gb2004510r80.
Bioconductor: open source software for bioinformatics. [http://www.bioconductor.org],
Comprehensive R Archive Network. [http://www.rproject.org],
Auer P, Doerge RW: TSPM.R: R code for a twostage Poisson model for testing RNAseq data. 2011, [http://www.stat.purdue.edu/~doerge/software/TSPM.R],
Smyth GK: limma: Linear Models for Microarray Data. [http://www.bioconductor.org/packages/release/bioc/html/limma.html],
Acknowledgements
The authors are grateful to Charles Wang and Leming Shi for the preliminary SEQC data and to Stephen Nutt for the RNAseq mouse data used to motivate aspects of the simulation study.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they do not have any competing interests.
Authors’ contributions
CL and GS developed the method and wrote the manuscript. GS implemented the method and CL performed the analyses. YC contributed to the comparative simulation studies and carried out the gene ontology analysis. WS performed read alignment and summarized the SEQC and mouse RNAseq datasets. All authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Law, C.W., Chen, Y., Shi, W. et al. voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol 15, R29 (2014). https://doi.org/10.1186/gb2014152r29
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/gb2014152r29
Keywords
 Differentially Express
 Negative Binomial
 Library Size
 Precision Weight
 Error Rate Control