Single-molecule nascent RNA sequencing captures RNA intermediates during termination
Our group recently developed a single-molecule nascent RNA sequencing method named FLEP-seq (Full-Length Elongating and Polyadenylated RNA sequencing) [38] by using the chromatin-bound nascent RNA to study the coordination between Pol II elongation and splicing [37]. Here, we demonstrate that the comprehensive profile of full-length nascent RNAs in Arabidopsis captured by FLEP-seq can be applied to study transcription termination by characterizing the various forms of RNA intermediate produced during the termination process (Fig. 1a). Compared to previous short-read Illumina-based methods, the long-read Nanopore sequencing of FLEP-seq enables us to distinguish the full-length readthrough RNAs from the 5′ and 3′ cleaved transient RNAs (Fig. 1a), which are generated by the 3′-end processing factors at the poly(A) site [7, 43]. FLEP-seq also captures the nascent polyadenylated mRNAs that are still associated with chromatin, thus allow us to obtain the precise location of poly(A) site for each gene [30, 44] (Additional file 1, Fig. S1a-c). Given the heterogeneity of poly(A) sites at Arabidopsis 3′ UTRs [19, 45], having both the poly(A) site information and the cleavage information in the same library can facilitate our analysis by enabling a more accurate assessment of cleavage events.
The advantage of having the unfractionated long RNAs from FLEP-seq data enabled us to distinguish and categorize the RNA intermediates into four groups: readthrough transcripts, 5′ cleavage products, 3′ cleavage products, and poly(A) transcripts (Fig. 1a; Additional file 1, Fig. S2). Take DRM2 (AT2G33830) as an example (Fig. 1b, upper panel), readthrough transcripts are those that extended downstream of the poly(A) site; 5′ cleavage products are those cleaved at the poly(A) site but not yet polyadenylated; 3′ cleavage products are full-length transcripts with their 5′ ends located downstream of the poly(A) site; and poly(A) transcripts are RNAs with poly(A) tails, and this result is also used to define the position of poly(A) site. Previous short-read based nascent RNA sequencing methods, such as pNET-seq, plaNET-seq, and GRO-seq have been applied in Arabidopsis [39,40,41,42] and in other plants including cassava and maize [46], are mostly developed for detecting Pol II-associated elongating RNAs, and can also detect RNA signal downstream of poly(A) site (Fig. 1b, lower panel). In particular, the pNET-seq method can detect nascent RNAs that are associated with various isoforms of Pol II CTD, including total CTD, unphosphorylated, phosphorylation on Ser 2 and Ser 5 [41], and thus may distinguish Pol II at certain transcription step. Compared to these methods, our FLEP-seq method captures a much more comprehensive and smoother pattern of termination and has the unique advantage of distinguishing whether the RNA intermediates are cleaved or not (Fig. 1b). Moreover, in the cases where two neighboring genes are immediately adjacent to each other, short-read-based methods would have difficulty assigning the reads near gene boundaries, while Nanopore long reads from adjacent genes can be easily separated (Fig. 1e).
Genome-wide analysis showed that the poly(A) transcripts are the most abundant forms of the four, followed by readthrough transcripts and 5′ cleavage products (Fig. 1c, d). The 3′ cleavage products only occupied a low proportion (1%) among the RNA intermediates, consistent with the rapid degradation of uncapped transcripts by exonuclease AtXRN3 in Arabidopsis [14, 47, 48]. FLEP-seq monitors both co-transcriptional cleavage and readthrough, whereas other methods do not. Therefore, by analyzing the 3′ end of the non-poly(A) RNAs, our results showed a clear pattern for termination: with a sharp peak enriched at the poly(A) site followed by a gradual decline (Fig. 1f). Taken together, the full-length nascent RNA sequencing method is well suited for tracking various transient RNA products during transcription termination.
The termination landscape of Arabidopsis genes
Transcription termination of Pol II is thought to be a random process that occurs at different distances downstream of the poly(A) site [18, 49]. We refer to the distance in which Pol II traveled past the poly(A) site before released from DNA as the previously described “termination window (TW)” [15, 17], also known as “termination zone” [18, 50]. To date, the genome-wide measurement of termination window has only been accessed by short-read sequencing, including in human cells with TT-seq method [17], in yeast with 4tU-seq method [15], and in Arabidopsis with plaNET-seq method [42]. The TT-seq study found a wide range of termination windows among ~ 7000 genes in human cells, with a median width of ~ 3300 bp and can go up to over 10 kb [17]. The 4tU-seq study found that the median termination window is 163 bp in yeast [15]. The plaNET-seq study suggested a median readthrough distance of ~ 500 bp for Arabidopsis genes [42]. Compared to these methods, our FLEP-seq method can capture and distinguish the various forms of full-length terminating RNAs with unprecedented depth and resolution, thus presents an opportunity to examine termination patterns for individual genes in detail. For each nascent RNA molecule, readthrough can be calculated as the distance between its 3′ end of the RNA and the poly(A) site of the gene (Fig. 2a). Because the detection of the longest readthrough RNA at any loci will be correlated with sequencing depth, using the longest readthrough distance to represent the termination window will make the estimation sensitive to fluctuations of sequencing depth (Fig. 1d, Fig. 2b, c). Our single-molecule long-read method can track each RNA molecule separately, therefore enabling us to use the median readthrough distance of each gene to represent its termination window size as a more robust measurement (Fig. 2b, d). We observed consistent TW size (Pearson’s R = 0.91) for individual genes between the two biological replicates of FLEP-seq libraries (Fig. 2d). Hence, we use the median readthrough distance to represent the TW size in the following analyses. We found the lengths of TWs vary drastically among the 9830 Arabidopsis genes analyzed (each with a minimal of 15 readthrough reads), ranging from ~ 50 nt to over 1000 nt (Fig. 2e), with the median size at ~ 160 nt (Additional file 2, Table S1). For example, the termination of gene AT3G51730 occurred quickly downstream of the poly(A) site with a TW size of 85 nt (Fig. 2f), while gene AT1G62820 (CML14) has many longer readthrough transcripts with a TW size of 686 nt (Fig. 2g). Even using the longest readthrough to represent TW, the median of which is 521 nt, the TW in Arabidopsis is still considerably shorter than the TW in human reported by TT-seq [17]. The much shorter termination windows of Arabidopsis compared to human may be due to the compact genome size of Arabidopsis, which is ~ 20 times smaller, and hence has a much denser gene arrangement. Taken together, our data reveals a comprehensive termination landscape of Arabidopsis.
Pol II co-localizes with termination factors at the end of termination window
The transition from transcription elongation to termination requires the slow down or pause of the elongating Pol II [51]. The pausing of Pol II at the 3′ end of the gene is usually associated with Pol II carboxy-terminal domain (CTD) serine 2 phosphorylation (Ser2P) [2, 41]. Previous studies revealed that the Pol II accumulation downstream of poly(A) site is related to the 3′ pausing [18, 50, 52]. Compared to previously published ChIP-seq data [53], we observed that ends of the termination window precisely reside with the peaks of Pol II and Ser2P, further suggesting that using median instead of the longest readthrough distance to represent TW is a more appropriate measurement (Fig. 3a). The presence of termination window engaged Pol II peak could also be observed in pNET-seq [41] and GRO-seq [40] data (Fig. 3b; Additional file 1, Fig. S3a).
During the termination process, many factors are associated with the Pol II elongation complex [1]. We found that BDR1, a negative elongation factor that prevents transcriptional interference in Arabidopsis [54], localized precisely at the end of the termination window (Fig. 3c). Similarly, an mRNA 3′-end processing factor FPA that controls the cleavage and polyadenylation [12, 55, 56] is found to coincide with BDR1 distribution (Fig. 3c). In addition, ATAC-seq data [57] showed a preference for open chromatin status at the end of termination window (Additional file 1, Fig. S3b), consistent with the previous report that nucleosome depletion is linked to the 3′-end formation [58]. While poly(A) sites are featured with low CG methylation and low GC content compared to flanking regions, we did not observe obvious change of DNA methylation or GC contents at major site for transcription termination (TWE) (Additional file 1, Fig. S3c, d). We also found that the strength of Pol II occupancy is negatively correlated with TW size (Fig. 3d), and similar trends were observed in the distribution of BDR1 and FPA. Previous study demonstrated that BDR proteins prefer to reside at gene borders, in both transcription start sites (TSS) and transcription end site (TES) [54]. It is worth to clarify that poly(A) site, TES, and TWE are different from each other—poly(A) site is where cleavage and polyadenylation occurs; TES in Araport11 is defined based on the reconstructed transcript assembly using a collection of Illumina RNA-seq [19], and TWE is the median readthrough length defined by FLEP-seq data (Additional file 1, Fig. S1d). Our results further revealed that most of FPA peaks overlapped with BDR1 peaks at termination window (Fig. 3e). Nevertheless, only a small portion of termination windows contained BDR1 and FPA peaks at their ends (Fig. 3e), suggesting many other factors are potentially involved in the termination process in Arabidopsis.
tRNA genes promote efficient termination of Pol II transcription
The tRNA genes are transcribed by Pol III [59]. In this study, we discovered a termination mechanism for Pol II that is shaped on the downstream tRNA genes. In the compact Arabidopsis genome, we identified in total more than 60 Pol II genes that have adjacent tRNA genes immediately downstream (distance to downstream gene < 200 nt) (Additional file 3, Table S2). In the tandemly arrange case, the readthrough transcripts terminated sharply at 60 nt upstream of the tRNA gene, and the distance between termination window end and downstream tRNA is consistently at ~ 60 nt (Fig. 4a, b). Our results showed that this highly efficient pattern of termination is specific to genes to which a tRNA gene locates downstream (Fig. 4c; Additional file 1, Fig S4). For the 34 genes whose termination is regulated by the downstream tRNA in tandem direction, 26 of them have termination window ends located ~ 60 nt upstream of the tRNAs. While in the convergently arrange cases, the distance between the termination window end and tRNA is around 10 nt with a less obvious concentration (Fig. 4d, e), and these genes also terminated more efficiently than those with non-tRNA immediately downstream (Fig. 4f). Previous report estimated that RNA polymerase may contact up to 90 bp of DNA (− 70 to + 20) at promoter regions [60]. Moreover, previous literature has shown that Pol II pausing or stalling can be influenced by chromatin structure [2]. Therefore, it is possible that the termination of Pol II elongation at − 60 nt is caused by a unique chromatin status at the tRNA gene promoter. To test this scenario, we analyzed published data of MNase-seq [61], ChIP-seq (H3K4me3, H3K36me3, H3K27me3, H3K9me2) [62], and the CG, CHG, and CHH DNA methylation level to check the chromatin status at the two tRNA loci showed in Fig. 4, including nucleosome positioning, histone modification, and DNA methylation (Additional file 1, Fig. S5). We did not observe an obvious enrichment or depletion of the common epigenetic marks around the tRNA promoters. Future studies on the factors that bind to the promoter of tRNA gene may help to explain its role in blocking Pol II transcription. Besides Arabidopsis, a previous report in Leishmania major also found tRNA gene can serve as terminator for Pol II transcription in the trypanosomatida [63].
We also analyzed the correlation between termination window size and intergenic distance, direction, and transcription level of downstream protein-coding and/or noncoding genes, respectively. The genome-wide analysis showed that the termination window size of genes arranged in tandem is similar to genes arranged in convergent direction (Additional file 1, Fig. S6a). We found that termination window size is slightly positively correlated with intergenic length and slightly negatively correlated with the transcription level of downstream genes (Additional file 1, Fig. S6b-d). We proposed that the pairing between Pol II gene and their downstream tRNA may be evolutionarily beneficial for assisting efficient termination in a small and compact genome (Fig. 4 g).
Mutation of AtXRN3 delays the termination of cleaved readthrough transcription
Many previous studies have supported the torpedo model for transcription termination, in which a nuclear 5′ → 3′ exonuclease plays the central role in degrading the 3′ cleavage products, and eventually catch up with elongating Pol II to expel it from DNA template via kinetic competition [2, 16, 18]. In Arabidopsis, two nuclear exonucleases AtXRN2 and AtXRN3 are orthologs of the human Xrn2 [11, 48]. Although they share extensive sequence similarities, only AtXRN3 is shown to be the primary exonuclease involved in Pol II termination [14, 47, 64]. RNA-seq analysis performed with total RNAs from weak alleles of atxrn3 show increased signal of mostly polyA+ mRNAs downstream of poly(A) site compared to wildtype [14, 47]. However, it remains unclear if the 3′ cleavage products without poly(A) tails accumulated in atxrn3 and how atxrn3 affects readthrough, due to the limitation of short-read sequencing. We set out to address these questions by taking advantage of our full-length nascent RNA sequencing method, which can capture the 3′ cleavage products, particularly those without poly(A) tails, enriched in the partial loss-of-function allele of atxrn3. In order to maximize the capture rate of 3′ cleavage products which might not be closely associated with chromatin, we used nuclear RNA instead of chromatin-bound RNA as the nascent RNA input of FLEP-seq. It turned out that both the nuclear fraction and the chromatin-bound fraction can efficiently capture the cleaved and readthrough transcripts (Additional file 1, Fig. S7a). Since nuclear RNA isolation involves fewer steps than chromatin-bound RNA isolation, we proceeded with nuclear RNA to make FLEP-seq libraries in a series of mutants and wildtype controls (please see “Methods” for detail). The termination window sizes from wildtypes using chromatin-bound RNA and nuclear RNA are highly consistent (Additional file 1, Fig. S7b), suggesting that nuclear RNA is a viable substitute for chromatin-bound RNA in studying transcription termination in Arabidopsis.
Compared to wildtype, the 3′ cleavage products drastically accumulated in the atxrn3 mutant (Fig. 5b), consistent with the function of AtXRN3 in 5′ → 3′ degradation of co-transcriptional cleavage products [9, 11, 14]. This accumulation leads to a clear peak of the 5′ end of cleaved readthrough reads (3′ cleavage products) at poly(A) site in atxrn3 mutant, which is absent in the wildtype control library (Fig. 5a). Similar results were also observed in human Xrn2 depletion cell line detected by short-read-based method POINT-5 [25]. In addition, accumulation of the 3′ cleavage products is not influenced by splicing, as genes with or without splicing have 3′ cleavage products enriched in atxrn3 and aligned accurately at the poly(A) site (Fig. 5b, Fig. S8). From a genome-wide perspective, we compared the size of termination windows in the atxrn3 mutant and in wildtype, and the result showed a strong impact of atxrn3 on termination window size at hundreds of gene loci (Fig. 5d). The termination window of 354 genes were statistically longer in the atxrn3 mutant than in wildtype (Mann–Whitney U test, p value < 0.001). These results illustrated that AtXRN3 is specifically responsible for the degradation of 3′ cleavage products in vivo.
In addition to atxrn3, we also characterized the fpa and met1 mutants by FLEP-seq. FPA is a component of the 3′ end processing complex, and MET1 is the key DNA methyltransferase in Arabidopsis [65]. The fpa mutant FLEP-seq data showed a global impact on termination window size and a prolonged 3′ end distribution compared to WT (Fig. 5c, d), consistent with its previously reported function as a termination factor [12, 56]. We also found some interesting cases that support the previous report of chimeric transcripts and cryptic splicing occurring in the fpa mutant [12, 56]. For example, some readthrough transcripts in the fpa mutant can extend into downstream genes to form chimeric RNAs accompanied by the cryptic splicing event (Additional file 1, Fig. S9a). Cryptic splicing event in the fpa mutant can even result in excision of the entire gene in the middle from the readthrough transcript that spans multiple genes (Additional file 1, Fig. S9b). In the met1-1 mutant of DNA methyltransferase MET1 [65], in which most CG methylation at genic regions is lost (Additional file 1, Fig. S13a), the distribution for 3′ ends of readthrough RNAs is largely unaffected (Fig. 5c), and the size of termination windows is similar to that in wildtype Col-0 (Fig. 5d). However, it is worth noting that met1-1 is not a null allele and still have some remaining CG methylation at TE regions (Additional file 1, Fig. S13b), recent characterization using ONT direct RNA sequencing of full-length mRNA from the strong met1-3 allele, which removes virtually all CG methylation, has discovered the effects of DNA methylation on splicing site and poly(A) site selection, as well as on poly(A) tail length [66].
Furthermore, we identified 14 genes with cleaved readthrough transcripts entering their immediate downstream genes in atxrn3 mutant (Additional file 4, Table S3). For example, AT1G73510 is a pollen-specific gene that is not expressed in seedlings (Additional file 1, Fig. S10), the materials used in our FLEP-seq libraries. In atxrn3, readthrough from its upstream gene NUDT21 (AT1G73540) and ORRM6 (AT1G73530) can continue elongation and pass through the entire downstream gene, and then be cleaved and polyadenylated (Fig. 6a). Strikingly, our full-length data revealed that the readthroughs in arxtn3 can be cleaved and polyadenylated multiple times as they elongate through several subsequent poly(A) sites in a row (Fig. 6a, magnified view). In addition, we found that cleaved readthrough transcription can yield normally spliced and polyadenylated mRNAs without their own transcription initiation (Fig. 6a). This can be clearly seen with a zoomed-in view around the TSS site of AT1G73530 in two biological replicates of the wildtype and atxrn3 FLEP-seq libraries, showing that some polyadenylated transcripts of AT1G73530 in atxrn3 originated from the upstream poly(A) site as cleavage products of upstream transcriptional readthrough, instead of their own initiation (Fig. 6b). A reordered view of reads at AT1G76180-AT1G76170 region also confirms this observation, with most reads of the downstream AT1G76170 come from the cleaved readthrough of the upstream gene AT1G76180 (Additional file 1, Fig. S11). Besides nascent RNA, we also check the mRNA level by analyzing previously published RNA-seq data of wide-type and atxrn3 mutant [14]. The coverage plot confirmed that more reads are aligned to the intergenic region of AT1G73540-AT1G73530 in atxrn3 mutant, compared to the fewer reads at the same regions in wildtype (Fig. 6c). It remains unclear if these mRNAs originated from 3′ cleavage products can be translated, as they may lack the 5′ cap structure. Previous work on AtXRN3 proposed several models to explain the elevated poly(A) + RNA-seq signal downstream of poly(A) site, including a role for transcription activation of downstream genes by readthrough transcription [14, 47]. Our single-molecule nascent RNA data suggests that readthrough transcription itself could be enough to drive the production of multiple downstream transcripts, highlighting the importance of AtXRN3-mediated transcription termination in the compact Arabidopsis genome (Fig. 6d).
However, the AT1G73530-AT1G73510 fusion poly(A) transcripts cannot be solely explained by the loss of 5′ to 3′ exonuclease activity as AtXRN3 should not affect the cleavage at the poly(A) site. It is possible that AtXRN3 may affect chromatin status that in turn determines poly(A) site selection and the readthrough phenomena. To check if DNA methylation status is altered at poly(A) sites in the atxrn3 mutant, we performed whole-genome bisulfite sequencing (WGBS) in seedlings of atxrn3 and corresponding Col-0 control. At the AT1G73530-AT1G73510 loci, there is little CG methylation in the wildtype, and it remains mostly unmethylated in the atxrn3 mutant (Additional file 1, Fig. S12a). This is consistent with previous DNA methylation profiling of multiple wildtype Arabidopsis libraries from the Jacobsen group [67] (Additional file 1, Fig. S12b). In addition, we found that the overall DNA methylation pattern at genic regions remains unchanged in the atxrn3 mutant compared to WT in CG, CHG, and CHH contexts (Additional file 1, Fig. S13c). However, individual poly(A) sites could still be affected, as a recent work has discovered large number of novel TTSs in the met1-3 mutant, suggesting a role of DNA methylation in poly(A) site selection [66]. We proposed that, besides exonuclease activity of AtXRN3, other factors such as chromatin status and DNA modification could also contribute to the accumulated readthrough transcripts in the atxrn3 mutant.