Estimating fragment bias in existing protocols
Fragment counts in an RNA-Seq experiment are determined by two different phenomena: fragments originating from highly expressed transcripts will appear more often in the data than those originating from lower-expressed transcripts, and library preparations include biases that may preferentially select some potential fragments over others. By fragment bias we mean only the over- or under-representation of fragments due to sequence-specific or positional bias as discussed in the Background. Because expression levels also affect fragment abundances, it is necessary to jointly estimate transcript abundances and bias parameters in order to properly learn the bias directly from RNA-Seq data.
This issue is illustrated by example in Figure 2 where the need for joint estimation of bias parameters and expression values is evidenced by comparison of the raw counts of bases at the starts/ends of fragments (panel a) and the adjusted counts normalized by the abundances of transcripts (panel b). The latter calculation is affected by the bias parameters, so that joint estimation is required. We expanded the likelihood framework described in [6] in order to perform such parameter estimation (see Materials and methods), resulting in 'learned' bias weights (panel d Figure 2) that were used to adjust expected fragment counts in the computation of abundances using our likelihood model. Figure 3 shows an example of how well these bias estimates capture the over- and under-representations of reads at different positions of a transcript, based on its sequence.
Validation by comparison to alternative expression assays
We emphasize that our goal was not to validate RNA-Seq per se, but rather to show that bias correction improves expression estimation. Therefore, in interpreting the correlations throughout the paper, we focused on improvements in correlation with bias correction and not on the absolute value. In this regard, we report most of our results as fraction discrepancy explained, which we calculated by dividing the change in R2 after bias correction by the difference of the initial R2 from 1 (a perfect correlation). Selected correlation plots can be found in Figure S3 of Additional file 1 and all raw expression data in Additional file 2. Furthermore, we mention that we observed that correlation results were sensitive to the extent of filtering of low abundance fragments and we therefore attempted to eliminate filtering in the experiments we performed (see Materials and methods for more detail).
A major problem with validating RNA-Seq expression estimates is that there is no clear 'gold standard' for expression estimation. Comparison of RNA-Seq to microarrays has suggested that the former technology is more accurate than the latter [13]. We examined the recently published NanoString nCounter gene expression system [14], but noticed many unexplainable outliers and high variance between technical replicates (see Figure S4 of Additional file 1 and data in Additional file 2). Quantitative reverse transcription PCR (qRT-PCR) has served as a benchmark in numerous studies but it is not a perfect expression measurement assay [15], and it is therefore a priori unclear which technology currently produces the most accurate expression estimates. Nevertheless, at present we believe it to be the best measure of expression aside from, perhaps, RNA-Seq itself. Due to the previously demonstrated superiority of RNA-Seq over microarrays, and the problems with NanoString, we performed all our benchmarking with respect to qRT-PCR.
We began by comparing the expression estimates on the Microarray Quality Control (MAQC) Human Brain Reference (HBR) dataset, which includes 907 transcripts with uniquely mapping TaqMan qRT-PCR probes [16], with RNA-Seq data from the same sample sequenced by Illumina (SRA012427) [17] (Figure 4). We examined the correlation of the Cufflinks output with the qRT-PCR expression data and observed an increase of R2 from 0.753 before correction to 0.807 after correction.
We examined the basis for change in correlation by further investigating, for each transcript, whether its expression estimate increased or decreased after bias correction, and by how much. The arrows in Figure 4 show the direction and extent of expression change with correction, and the overall fold-change distribution. Many fragments show large changes in expression with a median absolute fold change of 1.5 (Figure 4b). To establish the significance of the improvement in correlation, we performed a permutation test where we changed the expression estimates of transcripts randomly according to the fold change distribution in Figure 4b. We obtained a P-value of 0.0007, meaning that the improvement in R2 our correction accomplishes is highly significant. Together, these results show that bias correction may dramatically affect expression estimates via both increases and decreases of expression values, and that these changes provide an overall improvement in abundance estimates.
Comparison with previous methods
In [8], a method for bias correction is proposed that is based on correcting read counts for transcripts according to the bias learned for patterns at the start of reads (normalized using sequences in the interior of reads). This approach uses less information than our method, as it is restricted to learning bias within the read sequence, and cannot capture bias surrounding the start site. Furthermore, count-based methods do not fully exploit the information available in paired-end reads which allow for the determination of fragment length. Fragment length can help in assigning ambiguously mapped fragments to transcripts and our method takes advantage of this. On the other hand, since read counts have been promoted as an acceptable way to measure abundance [18], we compared the method to ours using the MAQC qRT-PCR data from the previous section. Figure 5 shows the results of the method of [8], both before and after bias correction (R2 = 0.711 before and R2 = 0.715 after correction). To obtain these results we used the software package Genominator[8], following the guidelines in the documentation, with the exception that bias was learned separately for each chromosome, as the software was not able to load an entire genome into memory. More details are provided in the Materials and methods section.
We also compared our approach to the mseq method in [10]. We again used the MAQC HBR qRT-PCR data and this time prepared the sequences and learned parameters for models following the suggested guidelines in [10], that is we trained the parameters of a MART model for bias by learning from the 100 most expressed transcripts in the experiment, and then tested on the set of 907 transcripts with uniquely mapping TaqMan probes. In this case, we observed an uncorrected R2 = 0.730 and corrected R2 = 0.755. Note that the even though the expression was again calculated using counts, the initial correlation of mseq is better than that of Genominator due to the fact that the implementation in [10] required us to remap the reads directly to the transcript sequences, which is presumably more accurate than relying on spliced mapping.
We suspect that the overall inferior results of both the Genominator and mseq in comparison to Cufflinks are due in part to the fact that the bias parameters cannot be learned from raw read counts, but must be normalized by the expression values of the transcripts from which the reads originate (Figure 2). For example, in [10], bias parameters are learned from what are estimated to be the most highly expressed transcripts based on RPKM, but these are likely to also be the most positively biased transcripts, and are therefore not representative in terms of their sequence content. We also believe that, as we argued in [6], it is important to account for fragment lengths in estimating expression, and read count based expression measures do not use such information. Another issue affecting Genominator is that instead of computing the expected read count as is done in Cufflinks and mseq, the observed read counts are adjusted. This means that in positions lacking read alignments, there is no correction of bias. We believe this may partially explain the improved performance of mseq in comparison to Genominator.
Technical replicates
A recurring worry with RNA-Seq has been that repeated experiments, possibly based on different libraries or performed in different laboratories, may be variable due to experimental 'noise'. We investigated these effects starting with an exploration of the correlation between technical replicates before and after bias correction. We define technical replicates to be the sequencing of two different libraries that have been prepared using the same protocol from a single sample. This differs slightly from some previous uses; in particular, technical replication has also referred to two sequencing experiments from the same library. Such replicates have already been shown to exhibit very little variability [18, 19].
We postulated that the differences between expression estimates from two different libraries should be reduced after bias correction. We tested this hypothesis in a series of analyses whose results are shown in Figure 6. First, we examined libraries prepared in two different experiments from the same MAQC Universal Human Reference (UHR) sample. In the first experiment [20], which we will refer to by its accession SRA008403, the sample was sequenced from one library preparation. In the second experiment [19], which we will refer to as SRA010153, the sample was sequenced in four separate library preparations. Although the same protocol was used in all five replicates, the learned bias weights differ somewhat between the data produced by the two labs (see Figure S2 in Additional file 1).
Figure 6 shows how correlations of the replicates with qRT-PCR and each other were affected by bias correction. Although the method does improve the pairwise correlations between different library preparations within SRA010153, the initial correlation is already so high (average R2 > 0.96) that we only show the average pairwise correlations against qRT-PCR and the SRA008403 dataset. The greater correlation among the SRA010153 replicates as compared to the correlation between them and SRA008403 further indicates that bias is more similar when the protocol is carried out by the same lab, presumably by the same person. Bias correction clearly recovers much of the differences in quantification between the replicates introduced by sequence and positional bias. Furthermore, as in the initial validation example, the correction brings both sets closer in line with the qRT-PCR standard.
Library preparation methods
In Figure 7 we demonstrate our ability to correct bias specific to libraries prepared using different protocols. For this experiment, we tried our method on several libraries from a study comparing strand-specific protocols (SRA020818) using the same yeast sample [11], as well as a dataset generated using the 'not so random' (NSR) priming protocol on the human MAQC HBR sample [21]. We compared all of these datasets with a standard Random Hexamer (RH) control for the given sample. Note that although the control (RH) and dUTP libraries have the same sequence bias (see Figure S2 in Additional file 1) and near-perfect initial correlation (R2 > 0.99), the remaining discrepancy is reduced by positional bias correction.
Because the NSR dataset was sequenced from the MAQC HBR sample, we were also able to compare it to the qRT-PCR standard. We found that our method explained 33.5% of the discrepancy between an initial estimation and qRT-PCR.
Sequencing platforms
Previous studies on bias in RNA-Seq have focused on experiments performed with Illumina sequencers. To investigate whether bias persists with other prep and sequencing technologies, we examined bias in a SOLiD experiment that sequenced both MAQC samples using the standard whole transcriptome (WT) protocol. We saw clear signs of both sequence-specific and positional bias that differed from the other protocols we had examined (see Figure S2 of Additional le 1).
We next compared the expression estimates for the SOLiD dataset with one from Illumina (accession SRA012427) before and after bias correction. In order to illustrate that our improvement in correlation does not come solely from correcting bias in the Illumina dataset, we tested whether there was some improvement from correcting one dataset at a time, as compared to simultaneous correction for both platforms. We found an increase of R2 from 0.74 to 0.88 (Illumina correction) and 0.85 (SOLiD correction) compared to 0.94 for both. These results are summarized in Figure 8. While one cannot draw general conclusions based on a single experiment, we note that our approach to quantifying bias should be useful in future studies that aim to quantitatively compare the bias among different sequencing platforms.