Improving RNA-Seq expression estimates by correcting for fragment bias
© Roberts et al.; licensee BioMed Central Ltd. 2011
Received: 4 December 2010
Accepted: 16 March 2011
Published: 16 March 2011
The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.
The randomness inherent in many of the preparation steps for RNA-Seq leads to fragments whose starting points (relative to the transcripts from which they were sequenced) appear to be chosen approximately uniformly at random. This observation has been the basis of assumptions underlying a number of RNA-Seq analysis approaches that, in computer science terms, invert the 'reduction' of transcriptome estimation to DNA sequencing [2–6]. However, recent careful analysis has revealed both positional  and sequence-specific [8, 9] biases in sequenced fragments. Positional bias refers to a local effect in which fragments are preferentially located towards either the beginning or end of transcripts. Sequence-specific bias is a global effect where the sequence surrounding the beginning or end of potential fragments affects their likelihood of being selected for sequencing. These biases can affect expression estimates , and it is therefore important to correct for them during RNA-Seq analysis.
Our main result is the development of a likelihood based approach for simultaneous estimation of bias parameters and expression levels using the likelihood framework of . This complements work of [8, 10] where corrections are developed based on another likelihood model, and also extends their work by incorporating simultaneous estimation and correction of positional bias. We demonstrate that our method improves expression estimates in comparison with independently obtained qRT-PCR on a benchmark dataset. Using the same data, we also show that our method improves on the approaches of [8, 10]. RNA-Seq technology is changing rapidly, and this is evident in the development of numerous preparation protocols (for a recent review see ) and increasingly longer read lengths from sequencing machines . When assessing the impact of bias correction, we have therefore included both early RNA-Seq data of the type that many laboratories might be producing with older machines, as well as newer data that reflects recent protocol choices and demonstrates the improvements in sequencing technologies. This has required us to make our methods robust to both single- and paired-end reads, strand specific and non-specific protocols, and a variety of priming and fragmentation methods. One of our main findings is that bias correction improves the correlation of expression estimates obtained from sequence data generated using different sample preparations and different sequencing technologies.
Results and Discussion
Estimating fragment bias in existing protocols
Fragment counts in an RNA-Seq experiment are determined by two different phenomena: fragments originating from highly expressed transcripts will appear more often in the data than those originating from lower-expressed transcripts, and library preparations include biases that may preferentially select some potential fragments over others. By fragment bias we mean only the over- or under-representation of fragments due to sequence-specific or positional bias as discussed in the Background. Because expression levels also affect fragment abundances, it is necessary to jointly estimate transcript abundances and bias parameters in order to properly learn the bias directly from RNA-Seq data.
Validation by comparison to alternative expression assays
We emphasize that our goal was not to validate RNA-Seq per se, but rather to show that bias correction improves expression estimation. Therefore, in interpreting the correlations throughout the paper, we focused on improvements in correlation with bias correction and not on the absolute value. In this regard, we report most of our results as fraction discrepancy explained, which we calculated by dividing the change in R2 after bias correction by the difference of the initial R2 from 1 (a perfect correlation). Selected correlation plots can be found in Figure S3 of Additional file 1 and all raw expression data in Additional file 2. Furthermore, we mention that we observed that correlation results were sensitive to the extent of filtering of low abundance fragments and we therefore attempted to eliminate filtering in the experiments we performed (see Materials and methods for more detail).
A major problem with validating RNA-Seq expression estimates is that there is no clear 'gold standard' for expression estimation. Comparison of RNA-Seq to microarrays has suggested that the former technology is more accurate than the latter . We examined the recently published NanoString nCounter gene expression system , but noticed many unexplainable outliers and high variance between technical replicates (see Figure S4 of Additional file 1 and data in Additional file 2). Quantitative reverse transcription PCR (qRT-PCR) has served as a benchmark in numerous studies but it is not a perfect expression measurement assay , and it is therefore a priori unclear which technology currently produces the most accurate expression estimates. Nevertheless, at present we believe it to be the best measure of expression aside from, perhaps, RNA-Seq itself. Due to the previously demonstrated superiority of RNA-Seq over microarrays, and the problems with NanoString, we performed all our benchmarking with respect to qRT-PCR.
We examined the basis for change in correlation by further investigating, for each transcript, whether its expression estimate increased or decreased after bias correction, and by how much. The arrows in Figure 4 show the direction and extent of expression change with correction, and the overall fold-change distribution. Many fragments show large changes in expression with a median absolute fold change of 1.5 (Figure 4b). To establish the significance of the improvement in correlation, we performed a permutation test where we changed the expression estimates of transcripts randomly according to the fold change distribution in Figure 4b. We obtained a P-value of 0.0007, meaning that the improvement in R2 our correction accomplishes is highly significant. Together, these results show that bias correction may dramatically affect expression estimates via both increases and decreases of expression values, and that these changes provide an overall improvement in abundance estimates.
Comparison with previous methods
We also compared our approach to the mseq method in . We again used the MAQC HBR qRT-PCR data and this time prepared the sequences and learned parameters for models following the suggested guidelines in , that is we trained the parameters of a MART model for bias by learning from the 100 most expressed transcripts in the experiment, and then tested on the set of 907 transcripts with uniquely mapping TaqMan probes. In this case, we observed an uncorrected R2 = 0.730 and corrected R2 = 0.755. Note that the even though the expression was again calculated using counts, the initial correlation of mseq is better than that of Genominator due to the fact that the implementation in  required us to remap the reads directly to the transcript sequences, which is presumably more accurate than relying on spliced mapping.
We suspect that the overall inferior results of both the Genominator and mseq in comparison to Cufflinks are due in part to the fact that the bias parameters cannot be learned from raw read counts, but must be normalized by the expression values of the transcripts from which the reads originate (Figure 2). For example, in , bias parameters are learned from what are estimated to be the most highly expressed transcripts based on RPKM, but these are likely to also be the most positively biased transcripts, and are therefore not representative in terms of their sequence content. We also believe that, as we argued in , it is important to account for fragment lengths in estimating expression, and read count based expression measures do not use such information. Another issue affecting Genominator is that instead of computing the expected read count as is done in Cufflinks and mseq, the observed read counts are adjusted. This means that in positions lacking read alignments, there is no correction of bias. We believe this may partially explain the improved performance of mseq in comparison to Genominator.
A recurring worry with RNA-Seq has been that repeated experiments, possibly based on different libraries or performed in different laboratories, may be variable due to experimental 'noise'. We investigated these effects starting with an exploration of the correlation between technical replicates before and after bias correction. We define technical replicates to be the sequencing of two different libraries that have been prepared using the same protocol from a single sample. This differs slightly from some previous uses; in particular, technical replication has also referred to two sequencing experiments from the same library. Such replicates have already been shown to exhibit very little variability [18, 19].
Figure 6 shows how correlations of the replicates with qRT-PCR and each other were affected by bias correction. Although the method does improve the pairwise correlations between different library preparations within SRA010153, the initial correlation is already so high (average R2 > 0.96) that we only show the average pairwise correlations against qRT-PCR and the SRA008403 dataset. The greater correlation among the SRA010153 replicates as compared to the correlation between them and SRA008403 further indicates that bias is more similar when the protocol is carried out by the same lab, presumably by the same person. Bias correction clearly recovers much of the differences in quantification between the replicates introduced by sequence and positional bias. Furthermore, as in the initial validation example, the correction brings both sets closer in line with the qRT-PCR standard.
Library preparation methods
Because the NSR dataset was sequenced from the MAQC HBR sample, we were also able to compare it to the qRT-PCR standard. We found that our method explained 33.5% of the discrepancy between an initial estimation and qRT-PCR.
Previous studies on bias in RNA-Seq have focused on experiments performed with Illumina sequencers. To investigate whether bias persists with other prep and sequencing technologies, we examined bias in a SOLiD experiment that sequenced both MAQC samples using the standard whole transcriptome (WT) protocol. We saw clear signs of both sequence-specific and positional bias that differed from the other protocols we had examined (see Figure S2 of Additional le 1).
Bias correction improves expression estimates
Our results confirm that bias correction improves expression estimates and should be used to correct bias introduced in library preparations and by sequencing technologies. We note that there is great variability in the extent of bias among protocols, and bias correction can dramatically affect expression estimates even in protocols of choice (for example the dUTP protocol currently favored by the Broad Institute ).
Implications for differential expression
It is particularly important to consider bias correction in the context of differential expression analysis. This can refer to the comparison of expression levels among transcripts in a single experiment (for example alternative isoforms of a gene), to the agglomeration of data produced by different laboratories, or to the comparison of expression among biological replicates.
We have shown that bias varies between library preps, even when the same protocol is used. However, our results indicate that this variance is much greater when either different protocols or technologies are used. Therefore, while bias correction can be expected to show small improvements in the former case, it is crucial in modern experiments that seek to combine and compare output from multiple library preps using the same or different protocols. For example, in the Drosophila modENCODE transcriptome experiment described in , both SOLiD and Illumina libraries were used at multiple time points during development. To estimate the improvement that could have been gained in the modENCODE experiment by using our correction, we ran Cuffdiff (the differential expression analysis tool packaged with Cufflinks) on the same samples used above to compare bias in the Illumina and SOLiD technologies. We found a 46% decrease in the number of differentially expressed transcripts output by Cuffdiff when bias correction was enabled.
Choice of model
We have developed a bias correction procedure based on a fragment model for RNA-Seq , in contrast with the site model of . We note that our choice is based partly on the observation in  that even after bias correction, variability in the counts of reads at individual sites differ considerably from the variance estimate obtained from the binomial model. Thus, it may be that the model of  is not robust in multiple isoform genes where few sites may distinguish isoforms. It is likely, however, that as RNA-Seq protocols improve and are better understood, site models will be preferable due to their improved resolution.
GC content and bias
Previous RNA-Seq investigations have revealed correlations between expression levels and GC content, and corrections have been proposed to 'normalize' the data with respect to this effect . When examining the sequence-specific bias profiles (see Figure S2 of Additional file 1) we noticed GC effects in the estimated parameters and so we investigated the relationship between sequence-specific bias correction and GC content.
To make the comparison, we defined the bias of a transcript to be the log fold change in effective length, which is a direct measure of the extent of correction of expression estimate in single isoform genes when incorporating bias correction.
We concluded that although normalization of expression values by GC content may be a simple way to remove some bias, it may well be a proxy for other effects rather than of inherent significance.
RNA-Seq data processing pipelines require multiple steps that include read mapping, transcript assembly, expression estimation and differential expression analysis. A difficulty with analysis is that many of these steps are closely related, and improvements in one area can be leveraged in another only if properly integrated. We have shown that in the case of bias correction, estimation of parameters together with abundances can improve expression estimates, and these can in turn affect differential expression analyses, mapping probabilities, and even assemblies.
In order to maximize the benefits of bias correction throughout the RNA-Seq analysis pipeline, we have incorporated it into the Cufflinks RNA-Seq analysis suite , and have pre-configured the software for specific protocols so that users can reap the benefits of bias correction for both stranded and unstranded protocols, as well as single- and paired-end reads. The software is freely available  and is distributed open source under the Boost Software License, version 1.0.
Materials and methods
Parameter estimation and inference
Due to the added sensitivity in our model to the location of fragment ends, we now rely on an empirical fragment length distribution whenever possible, as opposed to the Gaussian approximation in . The fragment length distribution is estimated in one of several ways, depending on what information is provided. If an annotation and paired-end read mappings are given, fragment mappings to single isoform genes are used to determine an empirical distribution. If no annotation is provided, but paired-end read mappings are provided, sufficiently large (≥ 1,000 bp) ranges are found where no fragments have spliced mappings. The mappings in these ranges are used to determine an empirical distribution. If no paired-end fragments are available or not enough are found in these ranges, we use a truncated Gaussian where all lengths less than the minimum read length in the dataset are set to zero probability and the remaining distribution is renormalized. The mean and standard deviation are set according to the distribution specified by the SRA entry, or to 200 and 80, respectively, if the information is unavailable.
The likelihood in our model is a function of the relative transcript abundances (ρ), consisting of the abundances for individual transcripts ρ t such that (here T denotes the set of all transcripts). In order to simplify computations, we estimate the relative abundances for overlapping sets of transcripts instead of directly estimating the parameters ρ t . We define a locus to be a genomic region containing a set of overlapping transcripts (typically isoforms of a gene) and then write the transcript abundance as ρ t = β g γt where β g is the relative abundance of the locus g in which t is contained, and is multiplied by a factor γ t that determines the proportion of each transcript within the locus. We denote the set of all loci by G (for more details see the Supplementary methods of ). Our updated likelihood model, whose full derivation is given in the Supplementary methods in Additional file 3 is then given by:
where F is the set of fragments, X g is the number of fragments with alignments to locus g, I t (f) is the implied length of a fragment f assuming it originated from a transcript t (this is needed because only the ends of fragments are sequenced), D(t , f) is the probability of observing a fragment of length I t (f) at a known position in a transcript, and the term is the probability of selecting a fragment of a specific length within a given transcript, based on the bias weights at its 5' and 3' end points.
The bias weight b(t, i, j) factors as where i and j are the 5' and 3' endpoints, respectively, of a fragment mapped to transcript t. The and weights measure sequence-specific bias and are found by calculating the ratio of the probability of the sequence surrounding the fragment end under the biased model to the uniform (null) model. Note that we model both ends separately due to the differences in sequence selectivity between the priming steps during first- and second-strand synthesis. In our method, these probabilities are actually learned from the data using a variable length Markov model  to capture dependencies between positions in the sequence. Complete details are in the Supplementary methods in Additional file 3.
The and weights measure the 5' and 3' positional biases, respectively. In  it was shown that positional effects depend on transcript length, so we modeled positional effects using 100 = 20 × 5 parameters, with 5 sets of parameters for different transcript lengths (see Figure S2 of Additional file 1). For each range of transcript lengths, the length is divided into 20 windows, each with its own parameter that reflects the probability that the 5' or 3' end of a fragment lies there as opposed to elsewhere on the transcript. The ratio of these probabilities under the biased model to the uniform (null) model is represented by and , respectively.
The parameters that need to be estimated in the likelihood function are the abundances ρ and the bias parameters described above. We estimate the parameters using coordinate ascent. The model is linear in the ρ parameters for fixed bias parameters, and we maximize them as in . For fixed ρ, the bias parameters can be maximized as described in the Supplementary methods in Additional file 3. Therefore, we employ an iterative coordinate ascent procedure that, in effect, jointly maximizes all parameters. We found, however, that the gains in likelihood after the first iteration do not justify the time requirements, and we therefore limit all experiments to a single iteration. An initial ρ0 estimation with uniform bias weights seeds the maximization of the bias parameters. ρ is then maximized using these bias parameters, and is used as the final abundance estimate.
Cell culture/RNA prep and NanoString: Mouse embryonic stem cells (V6.5) were co-cultured with irradiated mouse embryonic broblasts as described in . mESCs were passaged once on gelatin alone before RNA extraction. Total RNA was extracted from mESCs using the protocol specified in the RNeasy kit (Qiagen). 100 ng of total RNA was hybridized for 17 hours with lincRNA codeset in technical triplicate. The hybridized material was loaded into the nCounter prep station followed by quantification on the nCounter Digital Analyzer as outlined by NanoString Technologies in their Total RNA Gene Expression nCounter protocol.
Mapping and annotation
To allow for consistent comparison across datasets, all read mapping was carried out using TopHat 1.1.0  with supplied annotations and the --no-novel-juncs option set, except for the SOLiD datasets, which were only available in a pre-aligned form with mapping by BioScope 1.2.1. All expression estimation and bias correction were done using Cufflinks 0.9.3 with the same annotation and reference sequence as TopHat. In the case of strand-specific libraries, the correct --library-type option was used as per the Cufflinks manual. For the mouse dataset in the NanoString experiment, the RefSeq refGene annotation for assembly NCBI37/mm9 was used, and was downloaded from the UCSC Genome Browser. For all human datasets, the RefSeq refGene annotation for assembly NCBI36/hg18  was used, and was downloaded from the UCSC Genome Browser. The only filtering was to remove non-chromosomal and 'random' contigs. After quanti cation with Cufflinks, the subset of transcripts with 1-to-1 mappings to the TaqMan qRT-PCR probes were selected (as listed in the supplement to ) to be used in the correlation tests. All yeast datasets used the Ensembl Saccharomyces cerevisiae annotation, release 59, which was downloaded from the Ensembl website . Mitochondrial, non-coding, and ribosomal RNA sites were masked in the annotation. All remaining transcripts were used in our correlation tests.
Comparison with previous methods
We downloaded the Genominator package version 1.4.0 using Bioconductor. We then followed the instructors provided with the Genominator for 'Working with the ShortRead Package'. We used the same annotations as in our Cufflinks experiment to define the ranges and transcript lengths for the RPKM calculations. We also used the same read mapping as was used for Cufflinks. Due to memory limitations of the software, we were forced to learn the weights separately for each chromosome.
We downloaded the mseq package version 1.1 from the Comprehensive R Archive Network (CRAN). Due to the specific mseq input format requirements, we remapped the reads using Bowtie version 0.12.5  with the --best option and default parameters otherwise. The mapping was then converted to the mseq input format using a custom Python script we wrote and that is provided in Additional file 4. We followed the instructions of  and trained the MART model with suggested parameters on the top 100 expressed transcripts, which we determined by computing the RPKM for every transcript. The UTR regions and an additional 100 bases on the ends of transcripts were excluded from the training. The surrounding sequence window was set to be 8 bases before and 12 bases after the first nucleotide in the read, which matches the window of our variable length Markov model and is where we observed bias for the dataset. The resulting sequence preferences were used to find the corrected RPKMs.
All correlation tests used least squares linear regression, as implemented in the R programming language. We found the P-value in Section 2.2 by sampling (with replacement) from the empirical distribution of fold changes 50,000 times for each transcript in order to generate 50,000 randomly adjusted sets of expression values. Of these, only 35 showed correlations better than the values that were corrected by our method (R2 = 0.81).
Comprehensive R archive network
Fragments per kilobase per million reads sequenced
Human brain reference
Microarray quality control
Not not so random (hexamer priming)
Not so random (hexamer priming)
Quantitative reverse transcription polymerase chain reaction
Random hexamer (priming)
Reads per kilobase per million reads sequenced
Short read archive
Universal human reference
We thank Joshua Levin and Mitchell Guttman for their help with the NanoString experiment. Anat Caspi was instrumental in helping us obtain the SOLiD data. Adam Roberts was supported by an NSF graduate research fellowship.
- Marguerat S, Bähler J: RNA-Seq: from technology to biology. Cellular and Molecular Life Sciences. 2010, 67: 569-579. 10.1007/s00018-009-0180-6.PubMedPubMed CentralView ArticleGoogle Scholar
- Jiang H, Wong W: Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009, 25: 1026-1032. 10.1093/bioinformatics/btp113.PubMedPubMed CentralView ArticleGoogle Scholar
- Li B, Ruotti V, Stewart R, Thomson J, Dewey C: RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010, 26: 493-500. 10.1093/bioinformatics/btp692.PubMedPubMed CentralView ArticleGoogle Scholar
- Nicolae M, Mangul S, Măndoiu I, Zelikovsky A: Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms in Bioinformatics. 2010, 6293: 202-214. full_text.View ArticleGoogle Scholar
- Paşaniuc B, Zaitlen N, Halperin E: Accurate estimation of expression levels of homologous genes in RNA-seq experiments. Research in Computational Molecular Biology. Edited by: Berger B. 2010, Berlin/Heidelberg: Springer, 397-409. [Lecture Notes in Computer Science, vol 6044.]View ArticleGoogle Scholar
- Trapnell C, Williams B, Pertea G, Mortazavi AGK, van Baren M, Salzberg S, Wold B, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010, 28: 511-515. 10.1038/nbt.1621.PubMedPubMed CentralView ArticleGoogle Scholar
- Bohnert R, Rätsch G: rQuant.web: a tool for RNA-Seq-based transcript quantitation. Nucleic Acids Research. 2010, 38: W348-W351. 10.1093/nar/gkq448.PubMedPubMed CentralView ArticleGoogle Scholar
- Hansen K, Brenner S, Dudoit S: Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research. 2010, 38: 1-7. 10.1093/nar/gkp1195.View ArticleGoogle Scholar
- Srivastava S, Chen L: A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Research. 2010, 38: e170-10.1093/nar/gkq670.PubMedPubMed CentralView ArticleGoogle Scholar
- Li J, Jiang H, Wong W: Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biology. 2010, 11: R50-10.1186/gb-2010-11-5-r50.PubMedPubMed CentralView ArticleGoogle Scholar
- Levin J, Adiconis X, Yassour M, Thompson D, Guttman M, Berger M, Fan L, Friedman N, Nusbaum C, Gnirke A, Regev A: Development and evaluation of RNA-Seq methods. Genome Biology. 2010, 11: P26-PubMed CentralView ArticleGoogle Scholar
- Kircher M, Kelso J: High-throughput DNA sequencing - concepts and limitations. BioEssays. 2010, 32: 524-536. 10.1002/bies.200900181.PubMedView ArticleGoogle Scholar
- Bradford J, Hey Y, Yates T, Li Y, Pepper S, Miller C: A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling. BMC Genomics. 2010, 11: 282-10.1186/1471-2164-11-282.PubMedPubMed CentralView ArticleGoogle Scholar
- Geiss G, Bumgarner R, Birditt B, Dahl T, Dowidar N, Dunaway D, Fell H, Ferree S, George R, Grogan T, James J, Maysuria M, Mitton J, Oliveri P, Osborn J, Peng T, Ratcliffe A, Webster P, Davidson E, Hood L, Dimitrov K: Direct multiplexed measurement of gene expression with color-coded probe pairs. Nature Biotechnology. 2008, 26: 317-325. 10.1038/nbt1385.PubMedView ArticleGoogle Scholar
- Fleige S, Pfaffl M: RNA integrity and the effect on the real time qRT-PCR performance. Molecular Aspects of Medicine. 2006, 27: 126-139. 10.1016/j.mam.2005.12.003.PubMedView ArticleGoogle Scholar
- Shi L, Reid L, Jones W, Shippy R, Warrington J, Baker S, Collins P, de Longueville F, Kawakasi E, Lee K, Luo Y, Sun Y, Willey J, Setterquist R, Fischer G, Tong W, Dragan Y, Dix D, Frueh F, Goodsaid F, Herman D, Jensen R, Johnson C, Lobenhofer E, Puri R, Schrf U, Thiery-Mieg J, Wang C, Wilson M, Wolber P, et al: The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology. 2006, 24: 1151-1161. 10.1038/nbt1239.PubMedView ArticleGoogle Scholar
- Au K, Jiang H, Lin L, Xing Y, Wong W: Detection of splice junctions from paired-end RNA-Seq data by SpliceMap. Nucleic Acids Research. 2010, 38: 4570-4578. 10.1093/nar/gkq211.PubMedPubMed CentralView ArticleGoogle Scholar
- Anders S, Hüber W: Differential expression analysis for sequence count data. Genome Biology. 2010, 11: R106-10.1186/gb-2010-11-10-r106.PubMedPubMed CentralView ArticleGoogle Scholar
- Bullard J, Purdom E, Hansen K, Dudoit S: Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010, 11: 94-10.1186/1471-2105-11-94.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang E, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore S, Schroth G, Burge C: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456: 470-476. 10.1038/nature07509.PubMedPubMed CentralView ArticleGoogle Scholar
- Armour C, Castle J, Chen R, Babak T, Loerch P, Jackson S, Shah J, Dey J, Rohl C, Johnson J, Raymond C: Digital transcriptome profiling using selective hexamer priming for cDNA synthesis. Nature Methods. 2009, 6: 647-649. 10.1038/nmeth.1360.PubMedView ArticleGoogle Scholar
- Graveley B, Brooks A, Carlson J, Landolin J, Yang L, Artieri C, van Baren M, Boley N, Booth B, Brown J, Cherbas L, Davis C, Dobin A, Li R, Lin W, Malone J, Mattiuzzo N, Miller D, Sturgill D, Tuch B, Zaleski C, Zhang D, Blanchette M, Dudoit S, Eads B, Green R, Hammonds A, Jiang L, Kapranov P, Langton L, et al: The developmental transcriptome of Drosophila melanogaster. Nature. 20101, 471: 473-479.View ArticleGoogle Scholar
- Pickrell J, Marioni J, Pai A, Degner J, Engelhardt B, Nkadori E, Veyrieras J, Stephens M, Gilad Y, Pritchard J: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768-772. 10.1038/nature08872.PubMedPubMed CentralView ArticleGoogle Scholar
- Cufflinks software. [http://bio.math.berkeley.edu/cufflinks/]
- Bühlmann P, Wyner A: Variable length Markov chains. The Annals of Statistics. 1999, 2: 480-513.Google Scholar
- Guttman M, Garber M, Levin J, Donaghey J, Robinson J, Adiconis X, Fan L, Koziol M, Gnirke A, Nusbaum C, Rinn J, Lander E, Regev A: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology. 2010, 28: 503-510. 10.1038/nbt.1633.PubMedPubMed CentralView ArticleGoogle Scholar
- Short read archive. [http://www.ncbi.nlm.nih.gov/sra]
- SOLiD software and tools. [http://solidsoftwaretools.com/gf/project/wtpe/]
- Trapnell C, Pachter L, Salzberg S: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt K, Tatusova T, Klimke W, Maglott D: NCBI reference sequences:current status, policy and new initiatives. Nucleic Acids Research. 2008, 37: D32-D36. 10.1093/nar/gkn721.PubMedPubMed CentralView ArticleGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Larsson P, Longen I, McLaren W, Overduin B, Pritchard B, Riat HS, Rios D, Ritchie GRS, Ruffier M, Schuster M, et al: Ensembl 2011. Nucleic Acids Research. 2011, 39: D800-D806. 10.1093/nar/gkq1064.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.