A scaling normalization method for differential expression analysis of RNAseq data
 Mark D Robinson^{1, 2}Email author and
 Alicia Oshlack^{1}Email author
DOI: 10.1186/gb2010113r25
© Robinson and Oshlack; licensee BioMed Central Ltd. 2010
Received: 19 November 2009
Accepted: 2 March 2010
Published: 2 March 2010
Abstract
The fine detail provided by sequencingbased transcriptome surveys suggests that RNAseq is likely to become the platform of choice for interrogating steady state RNA. In order to discover biologically important changes in expression, we show that normalization continues to be an essential step in the analysis. We outline a simple and effective method for performing normalization and show dramatically improved results for inferring differential expression in simulated and publicly available data sets.
Background
The transcriptional architecture is a complex and dynamic aspect of a cell's function. Next generation sequencing of steady state RNA (RNAseq) gives unprecedented detail about the RNA landscape within a cell. Not only can expression levels of genes be interrogated without specific prior knowledge, but comparisons of expression levels between genes within a sample can be made. It has also been demonstrated that splicing variants [1, 2] and single nucleotide polymorphisms [3] can be detected through sequencing the transcriptome, opening up the opportunity to interrogate allelespecific expression and RNA editing.
An important aspect of dealing with the vast amounts of data generated from short read sequencing is the processing methods used to extract and interpret the information. Experience with microarray data has repeatedly shown that normalization is a critical component of the processing pipeline, allowing accurate estimation and detection of differential expression (DE) [4]. The aim of normalization is to remove systematic technical effects that occur in the data to ensure that technical bias has minimal impact on the results. However, the procedure for generating RNAseq data is fundamentally different from that for microarray data, so the normalization methods used are not directly applicable. It has been suggested that 'One particularly powerful advantage of RNAseq is that it can capture transcriptome dynamics across different tissues or conditions without sophisticated normalization of data sets' [5]. We demonstrate here that the reality of RNAseq data analysis is not this simple; normalization is often still an important consideration.
Current RNAseq analysis methods typically standardize data between samples by scaling the number of reads in a given lane or library to a common value across all sequenced libraries in the experiment. For example, several authors have modeled the observed counts for a gene with a mean that includes a factor for the total number of reads [6–8]. These approaches can differ in the distributional assumptions made for inferring differences, but the consensus is to use the total number of reads in the model. Similarly, for LONGSAGEseq data, 't Hoen et al. [9] use the square root of scaled counts or the betabinomial model of Vencio et al. [10], both of which use the total number of observed tags. For normalization, Mortazavi et al. [11] adjust their counts to reads per kilobase per million mapped (RPKM), suggesting it 'facilitates transparent comparison of transcript levels both within and between samples.' By contrast, Cloonan et al. [12] logtransform the gene lengthnormalized count data and apply standard microarray analysis techniques (quantile normalization and moderated tstatistics). Sultan et al. [2] normalize read counts by the 'virtual length' of the gene, the number of unique 27mers in exonic sequence, as well as by the total number of reads. Recently, Balwierz et al. [13] illustrated that deepCAGE (deep sequencing cap analysis of gene expression) data follow an approximate power law distribution and proposed a normalization strategy that equates the read count distributions across samples.
Scaling to library size as a form of normalization makes intuitive sense, given it is expected that sequencing a sample to half the depth will give, on average, half the number of reads mapping to each gene. We believe this is appropriate for normalizing between replicate samples of an RNA population. However, library size scaling is too simple for many biological applications. The number of tags expected to map to a gene is not only dependent on the expression level and length of the gene, but also the composition of the RNA population that is being sampled. Thus, if a large number of genes are unique to, or highly expressed in, one experimental condition, the sequencing 'real estate' available for the remaining genes in that sample is decreased. If not adjusted for, this sampling artifact can force the DE analysis to be skewed towards one experimental condition. Current analysis methods [6, 11] have not accounted for this proportionality property of the data explicitly, potentially giving rise to higher false positive rates and lower power to detect true differences.
The fundamental issue here is the appropriate metric of expression to compare across samples. The standard procedure is to compute the proportion of each gene's reads relative to the total number of reads and compare that across all samples, either by transforming the original data or by introducing a constant into a statistical model. However, since different experimental conditions (for example, tissues) express diverse RNA repertoires, we cannot always expect the proportions to be directly comparable. Furthermore, we argue that in the discovery of biologically meaningful changes in expression, it should be considered undesirable to have under or oversampling effects (discussed further below) guiding the DE calls. The normalization method presented below uses the raw data to estimate appropriate scaling factors that can be used in downstream statistical analysis procedures, thus accounting for the sampling properties of RNAseq data.
Results and discussion
A hypothetical scenario
Estimated normalization factors should ensure that a gene with the same expression level in two samples is not detected as DE. To further highlight the need for more sophisticated normalization procedures in RNAseq data, consider a simple thought experiment. Imagine we have a sequencing experiment comparing two RNA populations, A and B. In this hypothetical scenario, suppose every gene that is expressed in B is expressed in A with the same number of transcripts. However, assume that sample A also contains a set of genes equal in number and expression that are not expressed in B. Thus, sample A has twice as many total expressed genes as sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample is then sequenced to the same depth. Without any additional adjustment, a gene expressed in both samples will have, on average, half the number of reads from sample A, since the reads are spread over twice as many genes. Therefore, the correct normalization would adjust sample A by a factor of 2.
The hypothetical example above highlights the notion that the proportion of reads attributed to a given gene in a library depends on the expression properties of the whole sample rather than just the expression level of that gene. Obviously, the above example is artificial. However, there are biological and even technical situations where such a normalization is required. For example, if an RNA sample is contaminated, the reads that represent the contamination will take away reads from the true sample, thus dropping the number of reads of interest and offsetting the proportion for every gene. However, as we demonstrate, true biological differences in RNA composition between samples will be the main reason for normalization.
Sampling framework
S_{ k }represents the total RNA output of a sample. The problem underlying the analysis of RNAseq data is that while N_{ k }is known, S_{ k }is unknown and can vary drastically from sample to sample, depending on the RNA composition. As mentioned above, if a population has a larger total RNA output, then RNAseq experiments will undersample many genes, relative to another sample.
At this stage, we leave the variance in the above model for Y_{ gk }unspecified. Depending on the experimental situation, Poisson seems appropriate for technical replicates [6, 7] and Negative Binomial may be appropriate for the additional variation observed from biological replicates [14]. It is also worth noting that, in practice, the L_{ g }is generally absorbed into the μ_{ gk }parameter and does not get used in the inference procedure. However, it has been well established that gene length biases are prominent in the analysis of gene expression [15].
The trimmed mean of Mvalues normalization method
To robustly summarize the observed M values, we trim both the M values and the A values before taking the weighted average. Precision (inverse of the variance) weights are used to account for the fact that log fold changes (effectively, a log relative risk) from genes with larger read counts have lower variance on the logarithm scale. See Materials and methods for further details.
For a twosample comparison, only one relative scaling factor (f_{ k }) is required. It can be used to adjust both library sizes (divide the reference by and multiply nonreference by ) in the statistical analysis (for example, Fisher's exact test; see Materials and methods for more details).
Normalization factors across several samples can be calculated by selecting one sample as a reference and calculating the TMM factor for each nonreference sample. Similar to twosample comparisons, the TMM normalization factors can be built into the statistical model used to test for DE. For example, a Poisson model would modify the observed library size to an effective library size, which adjusts the modeled mean (for example, using an additional offset in a generalized linear model; see Materials and methods for further details).
A liver versus kidney data set
Number of genes called differentially expressed between liver and kidney at a false discovery rate <0.001 using different normalization methods
Library size normalization  TMM normalization  Overlap  

Higher in liver  2,355  4,293  2,355 
Higher in kidney  8,332  4,935  4,935 
Total  10,867  9,228  7,290 
House keeping genes (545)  
Higher in liver  45  137  45 
Higher in kidney  376  220  220 
Total  421  357  265 
Other datasets
The global shift in logfoldchange caused by RNA composition differences occurs at varying degrees in other RNAseq datasets. For example, an M versus A plot for the Cloonan et al. [12] dataset (Figure S3 in Additional file 1) gives an estimated TMM scaling factor of 1.04 between the two samples (embryoid bodies versus embryonic stem cells), sequenced on the SOLiD™ system. The M versus A plot for this dataset also highlights an interesting set of genes that have lower overall expression, but higher in embryoid bodies. This explains the positive shift in logfoldchanges for the remaining genes. The TMM scale factor appears close to the median logfoldchanges amongst a set of approximately 500 mouse housekeeping genes (from [17]). As another example, the Li et al. [18] dataset, using the llumina 1G Genome Analyzer, exhibits a shift in the overall distribution of logfoldchanges and gives a TMM scaling factor of 0.904 (Figure S4 in Additional file 1). However, there are sequencingbased datasets that have quite similar RNA outputs and may not need a significant adjustment. For example, the smallRNAseq data from Kuchenbauer et al. [19] exhibits only a modest bias in the logfoldchanges (Figure S5 in Additional file 1).
Spikein controls have the potential to be used for normalization. In this scenario, small but known amounts of RNA from a foreign organism are added to each sample at a specified concentration. In order to use spikein controls for normalization, the ratio of the concentration of the spike to the sample must be kept constant throughout the experiment. In practice, this is difficult to achieve and small variations will lead to biased estimation of the normalization factor. For example, using the spikedin DNA from the Mortazavi et al. data set [11] would lead to unrealistic normalization factor estimates (Figure S6 in Additional file 1). As with microarrays, it is generally more robust to carefully estimate normalization factors using the experimental data (for example, [20]).
Simulation studies
To further compare the performance of the TMM normalization with previously used methods in the context of the DE analysis of RNAseq data, we extend the above simulation to include replicate sequencing runs. Specifically, we compare three published methods: lengthnormalized count data that have been log transformed and quantile normalized, as implemented by Cloonan et al. [12], a Poisson regression [6] with library size and TMM normalization and a Poisson exact test [8] with library size and TMM normalization. We do not compare directly with the normalization proposed in Balwierz et al. [13] since the liver and kidney dataset do not appear to follow a power law distribution and have quite distinct count distributions (Figure S8 in Additional file 1). Furthermore, in light of the RNA composition bias we observe, it is not clear whether equating the count distributions across samples is the most logical procedure. In addition, we do not directly compare the normalization to virtual length [2] or RPKM [11] normalization, since a statistical analysis of the transformed data was not mentioned. However, we illustrate with M versus A plots that their normalization does not completely remove RNA composition bias (Figures S9 and S10 in Additional file 1).
Conclusions
TMM normalization is a simple and effective method for estimating relative RNA production levels from RNAseq data. The TMM method estimates scale factors between samples that can be incorporated into currently used statistical methods for DE analysis. We have shown that normalization is required in situations where the underlying distribution of expressed transcripts between samples is markedly different. The assumptions behind the TMM method are similar to the assumptions commonly made in microarray normalization procedures such as lowess normalization [21] and quantile normalization [22]. Therefore, adequately normalized array data do not show the effects of different total RNA output between samples. In essence, both microarray and TMM normalization assume that the majority of genes, common to both samples, are not differentially expressed. Our simulation studies indicate that the TMM method is robust against deviations to this assumption up to about 30% of DE in one direction. For many applications, this assumption will not be violated.
One notable difference with TMM normalization for RNAseq is that the data themselves do not need to be modified, unlike microarray normalization and some implemented RNAseq strategies [11, 12]. Here, the estimated normalization factors are used directly in the statistical model used to test for DE, while preserving the sampling properties of the data. Because the data themselves are not modified, it can be used in further applications such as comparing expression between genes.
Normalization will be crucial in many other applications of high throughput sequencing where the DNA or RNA populations being compared differ in their composition. For example, chromatin immunoprecipitation (ChIP) followed by next generation sequencing (ChIPseq) may require a similar adjustment to compare between samples containing different repertoires of bound targets. Interestingly, the PeakSeq method [23] uses a linear regression on binned counts across the genome to estimate a scaling factor between two ChIP populations to account for the different coverages. This is similar in principle to what is proposed here, but possibly less robust. We demonstrated that there are numerous biological situations where a composition adjustment will be required. In addition, technical artifacts that are not fully captured by the library size adjustment can be accounted for with the empirical adjustment. Furthermore, it is not clear that DNA spikedin at known concentrations will allow robust estimation of normalization factors.
Similar to previous high throughput technologies such as microarrays, normalization is an essential step for inferring true differences in expression between samples. The number of reads for a gene is dependent not only on the gene's expression level and length, but also on the population of RNA from which it originates. We present a straightforward and effective empirical method for normalization of RNAseq data.
Materials and methods
TMM normalization details
The cases where Y_{ gk }= 0 or Y_{ gr }= 0 are trimmed in advance of this calculation since logfoldchanges cannot be calculated; G* represents the set of genes with valid M_{ g }and A_{ g }values and not trimmed, using the percentages above. It should be clear that .
As Figure 2a indicates, the variances of the M values at higher total count are lower. Within a library, the vector of counts is multinomial distributed and any individual gene is binomial distributed with a given library size and proportion. Using the delta method, one can calculate an approximate variance for the M_{ g }, as is commonly done with log relative risk, and the inverse of these is used to weight the average.
We compared the weighted with the unweighted trimmed mean as well as an alternative robust estimator (robust linear model) over a range of simulation parameters, as shown in Figure S4 in Additional file 1.
Housekeeping genes
Human housekeeping genes, as described in [16], were downloaded from [25] and matched to the Ensembl gene identifiers using the Bioconductor [26] biomaRt package [27]. Similarly, mouse housekeeping genes were taken to be the approximately 500 genes with lowest coefficient of variation, as calculated by de Jonge et al. [17].
Statistical testing
For a twolibrary comparison, we use the sage.test function from the CRAN statmod package [28] to calculate a Fisher exact Pvalue for each gene. To apply TMM normalization, we replace the original library sizes with 'effective' library sizes. For two libraries, the effective library sizes are calculated by multiplying/dividing the square root of the estimated normalization factor with the original library size.
where represents the fraction of total reads for gene g in experimental condition z_{ k }. Their analysis utilizes an offset to account for the library size and a likelihood ratio (LR) statistic to test for differences in expression between libraries (that is, H_{0}:μ_{g1 }= μ_{g2}). In order to use TMM normalization, we augment the original offset with the estimated normalization factor. The same LR testing framework is then used to calculate Pvalues for DE between tissues. We modified this analysis to use an exact Poisson test for testing the difference between two replicated groups. The strategy is similar in principle to the Fisher's exact test: conditioning on the total count, we calculated the probability of observing group counts as or more extreme than what we actually observed. The total and group total counts are all Poisson distributed.
We reimplemented the method from Cloonan et al. [12] for the analysis of simulated data using a custom R [29] script.
Simulation details
The simulation is set up to sample a dataset from a given empirical distribution of read counts (that is, from a distribution of observed Y_{ g }). The mean is calculated from the sampled read counts divided by the sum S_{ k }and multiplied by a specified library size N_{ k }(according to the model). The simulated data are then randomly sampled from a Poisson distribution, given the mean. We have parameters specifying the number of genes common to both libraries and the number of genes unique to each sample. Additional parameters specify the amount, direction and magnitude of DE as well as the depth of sequencing (that is, range of total numbers of reads). Since we have inserted known differentially expressed genes, we can rank genes according to various statistics and plot the number of false discoveries as a function of the ranking. Table S1 in Additional file 1 gives the parameter settings used for the simulations presented in Figures 2 and 3.
Software
Software implementing our method was released within the edgeR package [30] in version 2.5 of Bioconductor [26] and is available from [31]. Scripts and data for our analyses, including the simulation framework, have been made available from [32].
Abbreviations
 ChIP:

chromatin immunoprecipation
 DE:

differential expression
 LR:

likelihood ratio
 RPKM:

reads per kilobase per million mapped
 TMM:

trimmed mean of M values.
Declarations
Acknowledgements
We wish to thank Terry Speed, Gordon Smyth and Matthew Wakefield for helpful discussion and critical reading of the manuscript. This work is partly supported by the National Health and Medical Research Council (481347MDR, 490037AO)
Authors’ Affiliations
References
 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
 Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML: A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008, 321: 956960. 10.1126/science.1160342.PubMedView ArticleGoogle Scholar
 Wang X, Sun Q, McGrath SD, Mardis ER, Soloway PD, Clark AG: Transcriptomewide identification of novel imprinted genes in neonatal mouse brain. PLoS One. 2008, 3: e383910.1371/journal.pone.0003839.PubMedPubMed CentralView ArticleGoogle Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003, 19: 185193. 10.1093/bioinformatics/19.2.185.PubMedView ArticleGoogle Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 5763. 10.1038/nrg2484.PubMedPubMed CentralView ArticleGoogle Scholar
 Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNAseq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008, 18: 15091517. 10.1101/gr.079558.108.PubMedPubMed CentralView ArticleGoogle Scholar
 Bullard JH, Purdom EA, Hansen KD, Durinck S, Dudoit S: Statistical inference in mRNASeq: exploratory data analysis and differential expression. UC Berkeley Division of Biostatistics Working Paper Series. 2009, paper 247Google Scholar
 Robinson MD, Smyth GK: Smallsample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics. 2008, 9: 321332. 10.1093/biostatistics/kxm030.PubMedView ArticleGoogle Scholar
 t Hoen PA, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RH, de Menezes RX, Boer JM, van Ommen GJ, den Dunnen JT: Deep sequencingbased expression analysis shows major advances in robustness, resolution and interlab portability over five microarray platforms. Nucleic Acids Res. 2008, 36: e14110.1093/nar/gkn705.View ArticleGoogle Scholar
 Vencio RZ, Brentani H, Patrao DF, Pereira CA: Bayesian model accounting for withinclass biological variability in serial analysis of gene expression (SAGE). BMC Bioinformatics. 2004, 5: 11910.1186/147121055119.PubMedPubMed CentralView ArticleGoogle Scholar
 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNASeq. Nat Methods. 2008, 5: 621628. 10.1038/nmeth.1226.PubMedView ArticleGoogle Scholar
 Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM: Stem cell transcriptome profiling via massivescale mRNA sequencing. Nat Methods. 2008, 5: 613619. 10.1038/nmeth.1223.PubMedView ArticleGoogle Scholar
 Balwierz PJ, Carninci P, Daub CO, Kawai J, Hayashizaki Y, Van Belle W, Beisel C, van Nimwegen E: Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biol. 2009, 10: R7910.1186/gb2009107r79.PubMedPubMed CentralView ArticleGoogle Scholar
 Robinson MD, Smyth GK: Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007, 23: 28812887. 10.1093/bioinformatics/btm453.PubMedView ArticleGoogle Scholar
 Oshlack A, Wakefield MJ: Transcript length bias in RNAseq data confounds systems biology. Biol Direct. 2009, 4: 1410.1186/17456150414.PubMedPubMed CentralView ArticleGoogle Scholar
 Eisenberg E, Levanon EY: Human housekeeping genes are compact. Trends Genet. 2003, 19: 362365. 10.1016/S01689525(03)001409.PubMedView ArticleGoogle Scholar
 de Jonge HJ, Fehrmann RS, de Bont ES, Hofstra RM, Gerbens F, Kamps WA, de Vries EG, Zee van der AG, te Meerman GJ, ter Elst A: Evidence based selection of housekeeping genes. PLoS One. 2007, 2: e89810.1371/journal.pone.0000898.PubMedPubMed CentralView ArticleGoogle Scholar
 Li H, Lovci MT, Kwon YS, Rosenfeld MG, Fu XD, Yeo GW: Determination of tag density required for digital transcriptome analysis: application to an androgensensitive prostate cancer model. Proc Natl Acad Sci USA. 2008, 105: 2017920184. 10.1073/pnas.0807121105.PubMedPubMed CentralView ArticleGoogle Scholar
 Kuchenbauer F, Morin RD, Argiropoulos B, Petriv OI, Griffith M, Heuser M, Yung E, Piper J, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, Hansen CL, Marra MA, Humphries RK: Indepth characterization of the microRNA transcriptome in a leukemia progression model. Genome Res. 2008, 18: 17871797. 10.1101/gr.077578.108.PubMedPubMed CentralView ArticleGoogle Scholar
 Oshlack A, Emslie D, Corcoran LM, Smyth GK: Normalization of boutique twocolor microarrays with a high proportion of differentially expressed probes. Genome Biol. 2007, 8: R210.1186/gb200781r2.PubMedPubMed CentralView ArticleGoogle Scholar
 Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002, 30: e1510.1093/nar/30.4.e15.PubMedPubMed CentralView ArticleGoogle Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
 Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIPseq experiments relative to controls. Nat Biotechnol. 2009, 27: 6675. 10.1038/nbt.1518.PubMedPubMed CentralView ArticleGoogle Scholar
 Casella G, Berger RL: Statistical Inference. 2002, Pacific Grove, CA: Duxbury PressGoogle Scholar
 Housekeeping Genes. [http://www.cgen.com/supp_info/Housekeeping_genes.html]
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R8010.1186/gb2004510r80.PubMedPubMed CentralView ArticleGoogle Scholar
 Durinck SMY, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005, 21: 34393440. 10.1093/bioinformatics/bti525.PubMedView ArticleGoogle Scholar
 CRAN  Package statmod. [http://cran.rproject.org/web/packages/statmod/index.html]
 Team RDC: R: A Language and Environment for Statistical Computing. 2009Google Scholar
 Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139140. 10.1093/bioinformatics/btp616.PubMedPubMed CentralView ArticleGoogle Scholar
 Bioconductor. [http://www.bioconductor.org/]
 WEHI Bioinformatics  Resources. [http://bioinf.wehi.edu.au/resources/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.