We have used data from the SEQC Sequencing study to investigate the effects of read-length on RNA-seq results and then validated the results using data from the ENCODE consortium. Since our main goal was to investigate the role of read length in determining RNA-seq results, we wanted to minimize all other variables. Therefore, we obtained the same sets of physical reads for the entire experiment and these reads were bioinformatically trimmed to produce reads of shorter lengths. This trimming is akin to what would have been obtained if the sequencing machine had been stopped earlier than it was for the longer reads. The quality and error profile of the 50th base of a 50 bp read is the same as that of the 50th base of a 100 bp read.
We took two samples from the Association of Molecular Resource Facilities (ABRF) SEQC study that consisted of RNA standards (see "Materials and methods"), labeled here as A and B. For each sample, we took three sets of paired-end 101 bp reads to form the basis of the analysis. These reads were then processed to produce 100, 75, 50 and 25 bp paired-end reads and 100, 75, 50 and 25 bp single-end reads. All of the reads were then aligned to the human genome using the STAR aligner.
Mapping statistics and splice junction detection
Mapping statistics were first examined (Fig. 1a), and the data showed overall consistent mapping statistics. All alignments of 25 bp read lengths contained a low number of uniquely mapped reads. This deficiency was partially improved when using 50 bp read lengths, while 75 bp and 100 bp reads contained almost the same number of uniquely mapped reads. The number of multi-mapped reads increased when using 25 bp reads, whereas all other read lengths were consistent. However, the number of multi-mapped reads increased significantly when using single-end reads (Fig. 1).
The number of splice junctions detected for each alignment was also determined (Fig. 1b). Alignments with 25 bp read lengths resulted in a significantly lower number of splice junctions detected (t = −13.13, p value = 2.2 × 10−16). Moreover, the numbers of splice junctions detected by the alignment of 100 bp reads was significantly higher than with any other read lengths (t = 7.08, p value = 9.5 × 10−8). Also, single-end reads detected fewer splice junctions overall when compared with the paired-end reads. In summary, longer paired-end reads are significantly better for splice junction detection.
Differential expression at different read lengths
The aligned reads were processed through one of three computational pipelines to determine differential expression (DESeq, EdgeR and Cufflinks). The log2 fold-change data from these pipelines were used to determine DEGs and we extracted the top 200 DEGs from each pipeline for comparison. There was a high overall degree of consistency between the three pipelines (Fig. 2) for the sets of genes that were determined to be significantly differentially expressed for any specific read length.
The top 200 DEGs for all read lengths for each differential expression method and each sample were sorted both by log2 based fold-change (Log2FC) of both up-regulated (+Log2FC) and down-regulated genes (−Log2FC) and by p value and compared. The number of orphan (read-length-specific) genes was then calculated. For single-end samples (Fig. 2a), 25 bp read lengths gave the highest percentage (average of 13.8 %) of orphan genes among all differential expression methods and samples. This shows that the differential expression profile calculated using the 25 bp read length is significantly different. Using paired-end 25 bp reads reduced this difference (Fig. 2b; down to an average of 5 %); however, among paired-end reads, 25 bp reads still gave the highest difference. The differences between 50 bp, 75 bp and 100 bp, for both paired-end and single-end reads, were small (0–12 % orphan genes).
The number of read lengths that support any specific DEG was calculated using the same overlap as above. For single-end reads (Fig. 2c), the percentage of DEGs supported by all four read lengths is fairly low. This percentage is significantly higher (t = −4.85, p value = 0.00015) when using paired-end reads (Fig. 2d), which suggests that paired-end reads are better for determining DEGs. Also, sorting on p value instead of Log2FC improves the overlap when using single-end reads and significantly improves the overlap when using paired-end reads.
For a true-positive comparison of gene expression, we used previously reported quantitative PCR (qPCR) results comparing the expression of samples A and B (Fig. 3). We calculated the Pearson correlation and root mean square deviation (RMSD) between all the DEGs for each read length and the qPCR results. While no read length results were completely concordant with the qPCR results, the lowest level of identity was achieved with the single-end and paired-end 25 bp reads and the quality did not significantly improve beyond that achieved with 50 bp single-end reads (Fig. 3a, b).
We next looked at the lists of the top 200 DEGs identified for each read length to determine their overlap (Fig. 3c, d). These lists represent the main result of an RNA-seq experiment performed using a particular read length. Overlap between the lists of the top 200 DEGs sorted by Log2FC shows that single-end 25 bp reads perform the worst; however, this performance is nicely recovered when using paired-end 25 bp reads (average of 121.5 versus 137 genes, respectively). Most 100 bp reads performed similarly when comparing paired-end versus single-end reads. Therefore, the gain in performance is not significant when using reads >50 bp for paired-end reads and >75 bp for single-end reads.
Additionally, we looked at genes that are paralogous (n = 21,459) to investigate whether longer read lengths result in higher mapping of overall reads to these genes (Figure S6a in Additional file 1). We calculated the total number of reads mapped to these genes in the paired-end SEQC dataset of samples A and B. To our surprise, with the exception of 25 bp reads, all the read lengths resulted in equal numbers of reads aligned to the paralogous genes. The overall trend of number of reads mapped as a function of read length stays the same regardless of paralogous gene length (Figure S6b in Additional file 1).
Splicing detection
Besides DEG detection, another function of RNA-seq is to determine splice sites and RNA isoforms. Using our variable length reads, we determined the number of splice sites that were detected using the STAR algorithm. We found that, for the detection of both known splice sites (Fig. 4a) and novel splice sites (Fig. 4b), there was a marked improvement with longer read lengths, and with the use of paired-end reads relative to single-end reads. The reason for this improvement is presumably because longer reads are generally superior for mapping and they have a greater chance of overlapping a splice junction, which is a major component of splicing detection. Based on the results, the longest reads (≥100 bp) should be used if junction detection is the primary goal of a sequencing experiment. This holds true for both known and novel junctions.
The percentage of splice junctions that were detected with all the reads lengths was also determined (Fig. 5). Overall, the percentage of common splice junctions detected with all four read lengths is very small (Fig. 5a). All four read lengths detect only about 8–9 % of known splice junctions and about 1–2 % of novel splice junctions. However, since 25 bp read lengths contribute to many false positives, the overlap was also analyzed after discarding results found with the 25 bp read length (Fig. 5b). The percentage of common splice junctions increased to over 50 % for known splice junctions and over 20 % for novel junctions. This also shows that the 25 bp read length does not detect many splice junctions that are detected by reads lengths greater than 25 bp.
To further confirm this result, we examined well-validated, “golden junctions”, which were found by all five sequencing platforms from the ABRF study on RNA-seq [3] (Figure S1 in Additional file 1). These data showed that 25 bp reads showed the lowest percentage overlap with golden junctions (9.6 %). However, read lengths longer than 25 bp all showed high overlap (97 %) and there was relatively little difference between read lengths. Therefore, read lengths of 50 bp or longer can still be used for detecting bona fide junctions.
Confirmation of results on ENCODE samples
In order to confirm our findings on an additional data set, we performed an identical analysis on RNA-seq data from NHDF and IMR90 cells obtained from an ENCODE study (Figures S3, S4, and S5 in Additional file 1). The only difference was that there were only two replicates available for each ENCODE sample whereas there were three replicates available for the SEQC samples. Additionally, there was no qPCR data available for the ENCODE samples. The results for SEQC samples were mainly replicated when using the ENCODE samples, with a consistent trend of little improvement over short reads’ utility for DEG detection but longer reads consistently improving isoform detection.
Confirmation of results using RSEM/EBseq method
We also aligned all the SEQC samples using RSEM [4] and then performed differential expression using EBseq [5]. The differential expression tables were then analyzed to find the number of orphan reads and percentage overlap similar to in Fig. 2 (Figure S2 in Additional file 1). RSEM/EBseq showed higher overlap and less orphan genes between read lengths. However, the overall trend is still similar and thus the conclusions made using the STAR aligner still hold true.