Adaptation of the early multiplexing RNA-seq library preparation workflow
First, we set out to benchmark SCRB-seq against the “gold standard” Illumina TruSeq workflow for bulk gene expression profiling. To do so, we prepared libraries following both protocols using RNA from GM12878 cells treated with either DMSO or IKK inhibitor (BAY 11-7082) to induce gene expression differences and thus to assess a potential difference between these two methods in the power to detect differentially expressed genes starting from the same RNA.
After sequencing, we first observed approximately 30% less SCRB-seq reads mapping to genes as compared to TruSeq (Fig. 1a), which implies that SCRB-seq libraries are more “contaminated” with undesired sequences (such as oligos, adapters, or polyA). This leads to a loss of approximately half of the initial sequenced reads, which may unnecessarily increase the sequencing need and thus overall cost. Interestingly, this effect was reproduced when aligning four publicly available bulk SCRB-seq datasets [14,15,16, 18] (Fig. 1a and Additional file 2: Table S1). Subsequently, we downsampled the respective libraries after alignment to consider an equal number of reads per replicate for both libraries (1M aligned reads, see the “Methods” section) and thus to allow a fair comparison between the SCRB-seq and TruSeq methods, thereby correcting for the discussed alignment issues. Upon investigating the complexity of the libraries (i.e., the number of detected genes), we found that at similar read depth (1M reads), SCRB-seq detected significantly less expressed genes than TruSeq (7% less genes across two conditions and three replicates, t test p value = 0.0038), thus revealing lower library complexity (Fig. 1b). We then performed an empirical power analysis between the two conditions of our LCL experiment (DMSO- or BAY 11-7082-treated LCL cells). We found that, with the same processed RNA, the SCRB-seq protocol uncovered ~ 20% less total differential expressed (DE) genes than the 1M downsampled TruSeq (Fig. 1c, 10 random downsampling). More importantly, the downsampled TruSeq was able to uncover ~ 35% more DE genes that were deemed “true positives” because these were uncovered using the full collection of 30M paired-end TruSeq reads. This points to a lower sensitivity of SCRB-seq libraries (less true positives/more false negatives). We concluded that in its original form, SCRB-seq is not competitive with TruSeq and that important workflow adaptations would be required to use this approach for bulk RNA sequencing.
Notably, we also noticed increased occurrences of “T” bases in the UMI sequence in the proximity of the dT stretch (Additional file 1: Figure S1c, left and center panels). We reasoned that since the stretch of 30 dT was not separated from the UMI sequence in the E3V6NEXT oligo-dT primer, oligonucleotides with longer dT had a higher affinity to the poly-A RNA tail, thus potentially affecting the diversity of the reads. This caused enhanced incorporation of primers containing UMIs and barcodes with higher dT, biasing the data. To overcome this issue, we designed novel BU3 primers so that the UMI and oligo-dT sequences were separated by five random non-T nucleotides (“V”), thus increasing the total UMI length to 15 nt (10 “N” + 5 “V”). This proved to be sufficient to reduce the overrepresentation of “T”-containing UMIs (Additional file 1: Figure S1c, right panel).
In addition, we anticipated that the efficiency of tagmentation might be increased by using Tn5 enzyme loaded with only i5 compatible adapters. Nextera Tn5 is a mix of transposases with two different adapter sequences (Tn5-A/B) intended to append either i5 or i7 Illumina indexes to generate compatible sequencing libraries. However, since the SCRB-seq libraries are amplified using only the i7 adapter (and a custom P5-TSO, bearing a P5 capture sequence), the cDNA fragments produced by introduction of the i5 compatible adapter sequence by Tn5 complex are not amplified by the limited-cycle PCR due to suppression PCR and are thus lost [21]. To reduce this loss, we used Tn5 enzymes that were produced in-house following the protocol of [22]. Indeed, we observed an increased library yield when in-house Tn5-B/B (loaded with only i7 compatible adapters) was used, compared to either Tn5 bearing both adapters, in-house made Tn5-A/B or the Nextera (Additional file 1: Figure S1d). Therefore, the use of in-house produced Tn5 helped to reduce the cost of library preparations. However, the impact of the Tn5 enzyme (A/B or B/B) on the sequencing data quality appeared to be relatively minor as confirmed by the downstream analysis (Additional file 1: Figure S2d), implying that one could still use Nextera Tn5 enzyme without loss of quality of the final data.
Second-strand synthesis without amplification improves data quality and biological relevance
Next, we performed a systematic evaluation of the key steps that might potentially affect the performance of SCRB-seq (Additional file 1: Figure S1b). To do so, we turned to a familiar model system that was also used in the original SCRB-seq paper [13]: adipocyte formation from human adipose stromal cells (hASCs), since a large number of genes show differential expression along this differentiation trajectory [23]. Specifically, we isolated total RNA from hASCs at two adipogenesis time points: t0 and t14 (non-differentiated ASCs and adipocytes, respectively) with two technical replicates each (Additional file 1: Figure S2a) after which we prepared cDNA libraries using our own set of improved barcoded primers (BU3).
We first tested different pre-amplification PCR cycle numbers (5, 10, and 15) as well as different input RNA amounts (1, 10, 100, 500, 1000, and 2000 ng), which may affect the overall amplification efficiency (Fig. 1d and Additional file 1: Figure S2b). To test the required combination of conditions, we prepared 18 libraries involving altogether 72 samples. This yielded two important insights: first, we detected an inverse correlation between the complexity/diversity of our RNA-seq libraries and the number of PCR cycles that were used to generate full-length double-stranded cDNA (Fig. 1d). Second, this effect was essentially independent of RNA input amount, although the highest performance in terms of uniquely mapped reads, percent duplication, mitochondrial read contamination, and the number of detected genes was generally observed between 10 and 100 ng of input RNA (Additional file 1: Figure S2b). Thus, five amplification cycles using 10–100 ng of input RNA appears preferred. We further found that this conclusion is independent of the RT enzyme used, since replacing Maxima Minus H (MMH) with SuperScript II (SSII) did not alter the number of detected genes using five amplification cycles and 100 ng of input RNA (Fig. 1e). Finally, our data revealed that the post-tagmentation library amplification step has a relatively minor impact on the downstream quality of the results as exemplified by solely 1–2% variation in read alignment rate and number of identified genes across the libraries amplified 8 to 12 PCR cycles (Additional file 1: Figure S2c).
The lowering data quality upon increasing the number of amplification cycles made us wonder whether PCR amplification in general is decreasing the quality of the output data. We therefore explored the value of using the Gubler-Hoffman procedure [24] to generate double-stranded cDNA instead of PCR amplification. While PCR amplification is easier to implement, the Gubler-Hoffman method bypasses the need for including a template switch oligo (TSO) in the first-strand synthesis, since the second-strand generation is driven by RNA primer-dependent nick translation by DNA polymerase I. Moreover, since we work with bulk RNA, samples may not require substantial amplification to enable subsequent tagmentation. In addition, for the remainder of the experiments, we used 100 ng of input RNA given the results discussed above and given that such an amount appears compatible with the majority of bulk RNA sequencing projects. As expected, we found that the yield of full-length cDNA generated with nick translation is lower compared to that obtained with PCR amplification and is dependent on the RT enzyme used (MMH or SSII) (Additional file 1: Figure S3a). Moreover, libraries that were generated with nick translation were more concentrated at the 3′-end of transcripts, an effect that was most visible when using SSII (Fig. 1f). The latter enzyme also yielded a lower rate of MT-rRNA reads compared to MMH (Additional file 1: Figure S3b). This is in line with the previously reported higher enzymatic activity of MMH compared to SSII [25], which may explain its lower specificity. Moreover, libraries prepared with nick translation involving the SSII enzyme had an increased ratio of reads mapping to annotated genes, namely ~ 76%, compared to ~ 65–70% produced with PCR amplification or when using the MMH enzyme (Additional file 1: Figure S3c). This was caused by a lower bias/noise resulting from the lower adapter and polyA contamination when preparing libraries using nick translation compared to pre-amplification (Additional file 1: Figure S3d). We concluded that second-strand synthesis via nick translation with SSII is preferable over the other combinations of second-strand synthesis/enzymes. These observations rationalize the novel Bulk RNA Barcoding and sequencing (BRB-seq) workflow, which features modified oligo-dT for cDNA barcoding and the second-strand synthesis involving DNA PolI Nick translation instead of PCR which accordingly enables the elimination of TSO for the first-strand synthesis (Fig. 2). The sequencing library is then prepared using cDNA tagmented by an in-house B/B Tn5 transposase and further enriched by limited-cycle PCR with Illumina compatible adapters.
BRB-seq outperforms SCRB-seq and its power is comparable to that of TruSeq
Next, we aimed at benchmarking our newly developed BRB-seq approach by comparing its output data to a reference “gold standard” dataset. To do so, we used again the Illumina TruSeq Stranded mRNA protocol and applied it on the same hASC RNA samples (Additional file 1: Figure S2a). First, we observed a high correlation between log2 transformed read count values of technical BRB-seq replicates (Pearson’s r = 0.98) (Fig. 3a) and similarly with TruSeq (r = 0.92) (Fig. 3b). The ratio of reads mapping to annotated genes was slightly lower than that of TruSeq (~ 76% vs. ~ 84%, Fig. 3c), but on average 22% higher than what was previously observed when using the original SCRB-seq protocol (Fig. 1a). The BRB-seq libraries showed high read diversity, allowing the detection of a comparable number of genes as TruSeq at the same sequencing depth (Fig. 3d). Importantly, we confirmed the high accuracy of DE gene detection of BRB-seq validated by the high number of DE genes overlapping with TruSeq (Fig. 3e). The latter detected only 7% more DE genes than BRB-seq, compared to 35% more than SCRB-seq (Fig. 1c). BRB-seq’s efficacy was further confirmed by increased fold change (t0 vs t4) correlation, as well as PR AUC and ROC AUC values (Additional file 1: Figure S4a, taking the full TruSeq ~ 30M paired-end run as “gold standard”). Importantly, we found that the ability to detect DE genes is inherently linked to the absolute gene expression levels and both TruSeq and BRB-seq exhibited very similar detection thresholds (Fig. 3f). We, therefore, concluded that a greater sequencing depth (> 5M reads) would in this case only be effective for BRB-seq or TruSeq libraries when specifically looking for DE genes with low to very low expression levels (i.e., CPM < < 1) (Fig. 3g).
We further investigated whether DE genes that were discovered with the two approaches were biologically relevant. For this, we conducted a functional enrichment analysis of the DE genes that were upregulated in the differentiated hASC cells using adipocyte-related gene sets from KEGG [38], Gene Ontology (GO) [37], and Gene Atlas databases. Overall, both BRB-seq and TruSeq DE genes were strongly enriched in adipocyte gene sets (Additional file 1: Figure S4b). It is also worth noting that the “Adipocyte” gene set (from Gene Atlas database) was slightly more enriched with BRB-seq as compared to TruSeq at a similar sequencing depth.
After having empirically validated the capacity of BRB-seq on real data, we aimed at evaluating its ability to uncover DE genes based on simulated data, where the DE genes are a priori known. To this end, we performed a power simulation using the powsimR package [26]. We thereby included, for the sake of comprehensiveness, not only our in-house generated data (SCRB-seq LCL, BRB-seq hASC, and TruSeq hASC) but also the published SCRB-seq datasets mentioned above [14,15,16, 18] since the DE genes are simulated. We performed the simulation using 5, 20, and 50 replicates downsampled at 1M reads (see the “Methods” section). The results of this analysis proved concordant with our empirical power analysis, showing again that BRB-seq was able to uncover DE genes at a level comparable with TruSeq (t test p value n.s.), while significantly higher than that of SCRB-seq (t test p < 0.05 for all three studies), and the effect is maintained for different numbers of replicates (Fig. 3h).
Given the performance of BRB-seq, combined with the fact that it is time- and cost-efficient, we envisioned that it could potentially become an alternative to RT-qPCR assays, especially when large sets of samples need to be profiled. To confirm that BRB-seq libraries can produce reliable gene expression results, we compared it to RT-qPCR data. We evaluated nine genes that are expressed at different levels in adipocytes. We performed two RT-qPCR replicates, one with 50 ng of RNA and the other with 500 ng using again the same RNA sample as was used to prepare the first-strand reactions for BRB-seq and TruSeq libraries (Additional file 1: Figure S2a). After normalization to HPRT1 expression, we assessed the correlation of expression values between each of the methods (Fig. 3i). We observed that both BRB-seq and TruSeq highly correlate with qPCR (Pearson’s r = 0.8–0.9) with BRB-seq slightly outperforming TruSeq. This effect was observed for both qPCR replicates.
Taken together, these results confirm the high overall performance of the BRB-seq approach, which yields a comparable efficiency/sensitivity as TruSeq, but at a fraction of its cost (see the “Discussion” section).
Multiplexing capacity of BRB-seq
So far, our experiments involved just a couple of samples. To assess whether BRB-seq’s performance would be maintained in a multiplexing context, we prepared an additional BRB-seq library containing 60 human lymphoblastoid cell line (LCL) samples, which have been routinely used in large-scale projects including the 1000 Genome Project. We focused on these cell lines since corresponding Illumina TruSeq data had been generated at two separate occasions, thus enabling a direct, comprehensive comparison between the two approaches. Specifically, we used two datasets: “TruSeq A” is from [27] involving all 60 samples that were profiled with BRB-seq and “TruSeq B” from [28] containing 53 of the 60 samples (Additional file 2: Table S2). Of note, the libraries of both TruSeq datasets were prepared using TruSeq RNA Sample Prep Kit v2, which does not preserve strand-specific information, contrary to the BRB-seq and TruSeq mRNA Stranded protocols that were used before. However, given that only poly-A+ transcripts are profiled, we assume that differences in DE power between these TruSeq protocols are rather minor.
Our analyses showed that BRB-seq libraries identified over 14k protein-coding genes across the 60 samples (i.e., detected in at least one sample). The fraction of genes detected within all three datasets (Fig. 4a, yellow sector) represented over 97% of BRB-seq genes and 84–87% of the genes discovered by TruSeq. Importantly, this overlapping population contained all highly expressed genes (CPM > 100), all but 54 medium-expressed genes (1 < CPM < 100, Fig. 4b, blue population), and over 2600 lowly expressed genes (CPM < 1, Fig. 4b, yellow population). Thus, the genes that remained undetected by BRB-seq (1687 genes, Fig. 4a and Fig. 4b, blue population) contained predominantly lowly expressed genes (n = 1637, CPM < 1) and no highly expressed genes (CPM > 100). This likely reflects the fact that BRB-seq was initially sequenced to a lower level (6M single-end reads per sample on average) compared to TruSeq (13.6M and 29.7M paired-end reads for TruSeq A and B, respectively). Even prior to downsampling to 1M reads, therefore, some lowly expressed genes may not have been sequenced enough to aggregate at least one read in the BRB-seq dataset and thus may also not be detectable upon downsampling. Similarly, most genes that were uniquely identified within each dataset, including by BRB-seq, tend to be lowly expressed (CPM < 1) (Fig. 4b).
We further found an overall high correlation between BRB-seq and TruSeq A and B log2 read count values (Pearson’s r = 0.89 and 0.89, Fig. 4c), performed for each replicate sample across protocols. Finally, across the samples, the overall correlation was above 0.8 and only slightly lower compared to what was found for the two TruSeq datasets (Fig. 4d).
Taken together, these results show that BRB-seq constitutes a highly affordable (see the “Discussion” section), robust high-throughput 3′-end transcriptomics approach that produces data featuring a quality that is comparable to that of the “gold standard” TruSeq methods.
BRB-seq performs well on low-quality RNA samples
It is well established that the TruSeq Stranded mRNA method performs poorly on degraded RNA samples given the intrinsic requirement of this method to have an RNA quality number (equal to RIN, RNA integrity number) ≥ 7–8. This may reflect the fact that full-length transcripts are sequenced, thus requiring high-quality, intact RNA for accurate detection and quantification. Since 3′ RNA fragment quantification is known to be a robust way to estimate differential gene expression in samples with low RNA quality numbers (RQNs) [29], we decided to evaluate the performance of BRB-seq on fragmented RNA samples with low RQN values. For this, we employed chemical RNA fragmentation by incubation at 65 °C in the presence of Mg++ cations for 1 or 2 min, which resulted in a significant reduction in overall RNA size and RQN values (Additional file 1: Figure S5).
As expected, we observed a clear inverse correlation between the quality of the samples and their RQN values, but of minor effect size. Indeed, the correlation between fragmented and non-fragmented samples remained above 97%, even for samples with very low RQN (Fig. 5a). Detection of DE genes in the degraded versus intact samples was more substantially affected by prolonged fragmentation and observed by lowered fold change correlation, PR AUC, and number of detected DE genes (Fig. 5b). Nevertheless, we could still detect more than 75% of true DE genes in the samples with RQN values as low as 2.2, which is generally considered as a mark of very highly degraded RNA (Fig. 5b). Together, these data show that BRB-seq allows reliable differential gene expression and functional enrichment analyses, even on low-quality/degraded RNA samples.
BRB-seq data analysis pipeline and considerations
Upon the sequencing of the BRB-seq libraries, highly multiplexed datasets are produced which may pose analytical problems, specifically for users with limited bioinformatic skills. To make the entire workflow of the method accessible to the scientific community at large, we aimed at streamlining the analysis of the sequenced data. For this, we developed a complete tool suite (http://github.com/DeplanckeLab/BRB-seqTools), supporting all the required post-sequencing tasks up until the generation of the read/UMI count matrix (Fig. 6a and detailed in Additional file 3: Supp. Method).
Thereafter, the data can be processed with conventional R scripts/packages to perform the required analyses or even Excel for direct visualization. Alternatively, the count matrix file can be supplied to ASAP (https://asap.epfl.ch/), a web-based platform devoted to comprehensive/automated transcriptome analyses developed in our lab [30]. Consequently, along with the protocol itself, we provide a seamless pre- and post-treatment pipeline for enabling any user to perform a state-of-the-art analysis of their BRB-seq data.