 Method
 Open access
 Published:
Synergising singlecell resolution and 4sU labelling boosts inference of transcriptional bursting
Genome Biology volume 24, Article number: 138 (2023)
Abstract
Despite the recent rise of RNAseq datasets combining singlecell (sc) resolution with 4thiouridine (4sU) labelling, analytical methods exploiting their power to dissect transcriptional bursting are lacking. Here, we present a mathematical model and Bayesian inference implementation to facilitate genomewide joint parameter estimation and confidence quantification (R package: burstMCMC). We demonstrate that, unlike conventional scRNAseq, 4sU scRNAseq resolves temporal parameters and furthermore boosts inference of dimensionless parameters via a synergy between singlecell resolution and 4sU labelling. We apply our method to published 4sU scRNAseq data and linked with ChIPseq data, we uncover previously obscured associations between different parameters and histone modifications.
Background
The canonical understanding of transcription is that it consists of the steps of initiation, elongation and termination. During initiation of transcription in eukaryotes, RNA polymerase (RNAP) is recruited to the promoter via transcription factors (TF), followed by the synthesis of the first few bases of the new transcript [1]. Elongation succeeds initiation, in which RNAP processes along the gene, incorporating RNA nucleotides into the nascent transcript as it progresses [2]. Upon reaching the transcription end site (TES), termination occurs, in which the transcript and RNAP are released from the DNA [3]. Various processing steps take place at different points during transcription to allow for a mature transcript to be produced, including 5′ capping during initiation, splicing to remove intronic (noncoding) sequences during elongation of proteincoding genes, and polyadenylation and cleavage during termination [1,2,3,4].
Beyond the general mechanism outlined above, transcription is also a stochastic process subject to intrinsic noise through its fundamental dependence on probabilistic collisions between molecules [5, 6], which are often present in relatively low numbers. Additionally, in many cases, transcription occurs only in short, intense bursts of activity followed by prolonged periods of inactivity, resulting in increased cellcell variability in transcript counts [7, 8]. Indeed, studies have identified a broad spectrum of genes, from those that are transcribed in a Poissonian fashion, such as housekeeping genes, to those which are very bursty in nature and expressed only in relatively short windows [9, 10]. The transcriptional noise and cellcell variance induced by bursting can be utilised to, for example, achieve alternative cell fates during differentiation of cell populations without requiring explicit control by genetic programming or external signals [11]. There are several different possible mechanisms thought to contribute to bursting, including the process of reinitiation, in which after transcribing a gene the RNAP is immediately recycled to the transcription start site (TSS) instead of simply terminating and disengaging [12]. This requires looping of the gene to bring the TSS and TES into physical proximity [13], and the link between TSSTES interactions and bursting has been explored recently [14]. The chromatin state of a gene also plays an important role in governing transcriptional bursting dynamics, which in eukaryotes is dictated largely by histone modifications (HM). Different HMs may result in looser or tighter packing of the chromatin, respectively, with the chromatin density around the TSS being correlated with transcriptional noise [15]. Having active HMs at the TSS results in an increased probability of open chromatin, which facilitates initiation. This is proposed to reduce burstiness, possibly by reducing the duration between active periods [15, 16]. More recent studies have also reported genomewide direct correlations between the presence of specific HMs at gene promoters and general transcriptional noise [17, 18], while further studies have even linked HMs with the underlying bursting dynamics, both at the individual gene level [19] and genomewide [20]. Transcriptional bursting in bacteria can also result from supercoiling of the DNA [21]. The proposed mechanism is the accumulation of positive supercoiling caused by the RNAP proceeding through the gene, until it reduces the rate of elongation to the point that it prevents further transcription. Intermittent clearing of supercoiling followed by rapid transcription, and subsequent reaccumulation of supercoiling, results in bursty transcription. Studies have also observed the cocondensation of TFs with transcriptional coactivators such as p300, which mediates cooperative activation of genes by clusters of TFs [22]. This cooperative activation results in nonlinear gene regulation and increased burst frequency and burst size for genes enriched in coactivators.
Transcriptional bursting may be understood in terms of several parameters (Fig. 1a), including the burst size (transcripts produced per burst, b), burst frequency (bursts per unit time, \(\kappa\)), decay rate (transcripts degraded per unit time, \(\delta\)), transcript lifetime (average transcript survival time, \(\gamma =1/\delta\)), burst rate (bursts per transcript lifetime, \(a=\kappa /\delta\)), and expression level (mean transcripts per cell, \(\mu =b\times a\)). Many studies make use of fluorescence microscopybased approaches to interrogate transcriptional bursting dynamics. Single molecule fluorescence in situ hybridisation (smFISH) is a particularly popular approach here although the standard procedure offers only a snapshot of transcript counts across a cell population, with no timevariant information. Therefore, the timescales of bursting events may not be discerned [23], allowing estimation of \(\mu\), b and a but not \(\kappa\) or \(\delta\). Some smFISHbased experimental setups have progressed towards a level of understanding bursting timescales by using hybridisation specific to nascent transcripts [24, 25], although smFISH approaches generally suffer from scalability. While progress is being made towards multiplexing, it can still only analyse a handful of genes at a time compared with sequencing [26,27,28] or requires complex and labourious setups [29]. Sophisticated analysis methods [30] have been developed for timelapse singlecell RNA imaging data [31] which allows dissection of transcriptional dynamics in great detail, however such approaches are even more limited scalewise.
Single cell RNAseq (scRNAseq) experiments are widely used to analyse genomewide bursting dynamics. However, scRNAseq suffers from the same issue as standard smFISH regarding analysis of bursting timescales because it only provides a snapshot of the transcriptomes of a population of cells at a single point in time. Therefore, it has only been possible to obtain burst sizes (b) and burst rates (a), while burst frequencies (\(\kappa\)) may not be understood without making assumptions or using prior information on decay rates (\(\delta\)) measured through separate experiments [10, 32,33,34]. On the other hand, bulk RNAseqbased approaches have for several years made use of chemically labelled nucleotides, primarily 4thiouridine (4sU) as in SLAMseq, to understand RNA synthesis (\(b\times \kappa\)) and degradation (\(\delta\)) rates [35, 36]. The cells are incubated in the presence of 4sU for a given duration, prior to RNA extraction. During this step, 4sU diffuses into the cell nucleus and becomes incorporated into nascently transcribed RNA. Labelled RNA can be bioinformatically distinguished from nonlabelled RNA, previously residing in the cell, due to the higher rate of chemically induced cytosine conversion of 4sU relative to regular uracil. Using mathematical modelling, the ratio of labelled to unlabelled transcripts can be used to estimate the turnover rate [37]. However, since bulk RNAseq neglects the cellcell variability, it can not be used to study bursting dynamics. Recent advances combine scRNAseq with 4sU and such datasets have the potential to fully characterise transcriptional bursting dynamics and their timescales (Fig. 1b). Thus far, they have been used for understanding dynamic changes in the transcriptome and/or RNA turnover/splicing rates that occur throughout the cell cycle and cell state transitions [38,39,40,41,42]. Studies with data of this type that have looked at bursting have only done so in a limited manner, using empirically derived statistics as a proxy for burstiness [43], while bursting timescales have remained uncharacterised in recent works [44]. This is despite previous modelling works having shown that degradation is expected to contribute significantly to transcriptional noise and therefore should be accounted for when investigating bursting dynamics [45].
Here, we construct mathematical models to relate observables from 4sU scRNAseq data to the underlying bursting dynamics and develop an adaptive Markov chain Monte Carlo (MCMC) approach for Bayesian inference of the parameters governing those dynamics. We have produced an R package (https://github.com/hebenstreitLab/burstMCMC) from our method and applied this to published data from [38], demonstrating that we are able to characterise timeresolved transcriptional bursting dynamics for hundreds of genes in parallel. Our approach generates joint probability distributions of the parameters of interest from which estimates can be extracted and confidence in these quantified. This is the first method for joint inference of timeresolved bursting dynamics on a genomewide scale and is generally applicable to 4sU scRNAseq datasets. We also show that, even for the dimensionless parameters which can be obtained with conventional scRNAseq, the accuracy and reliability of estimates can be improved by incorporating the additional information provided by 4sU scRNAseq. Finally, we build on a previous study which interrogated correlations between bursting parameter estimates and HMs in a genomewide manner, linking scRNAseq with ChIPseq data [20]. Our analysis reveals positiondependent associations between different parameters and HMs only apparent with 4sU scRNAseq.
Results
Model comparison
We tested the advantages provided by 4sU scRNAseq data coupled with our inference approach over conventional scRNAseq by comparing our recovery of known bursting parameter values from a simulated dataset using different likelihood functions (Methods). The MCMC algorithm was run five times, using Eqs. 4, 15, 16, 19 and 20 as the likelihood functions, referred to as L1, L2, L1+L2, L3 and L1+L3, respectively.

L1: The likelihood function of model 1, equivalent to scRNAseq data without 4sU, relying solely on the UMI counts.

L2: Equivalent to relying only on single cell T>C conversions, without fully incorporating the UMI counts.

L1+L2: The likelihood function of model 2, equivalent to 4sU scRNAseq data, incorporating all of the available information together.

L3: Equivalent to bulk SLAMseq data without spikeins, ignoring UMI counts and using only cellsummed T>C conversions.

L1+L3: The likelihood function of model 3, equivalent to combining bulk SLAMseq data without spikeins and scRNAseq data.
Convergence to the target distribution is shown in Fig. 2 for each likelihood function, confirming that scRNAseq data cannot resolve \(\kappa\) or \(\delta\), but does converge for the other parameters, while L2 and L1+L2 converge for all parameters, confirming that 4sU scRNAseq data can timeresolve bursting. Unlike L2, L3 is unable to converge for any parameters other than \(\delta\), further demonstrating the advantage of cellspecific vs cellsummed T>C conversion data. Conversely, L1+L3 does converge for all parameters, with L1 informing burstiness while L3 informs timescales.
The resulting posteriors (Fig. 3) indicate that the accuracy and precision of estimates for a, b and \(\mu\) are improved by incorporating the singlecell 4sU conversion data compared to relying solely on scRNAseq or scRNAseq with bulk SLAMseq data, which is because the cellcell variance in the T>C rate is a function of the transcriptional noise (burstiness) of the gene as well as turnover and, therefore, including such information makes the estimation more robust. Likewise, we see that while conventional scRNAseq may not resolve \(\kappa\) or \(\delta\), including the UMI count information with the conversion data also results in more precise and accurate estimates of these parameters. This is because the set of T>C conversions is a function of a, b and \(\delta\), while the UMI counts are a function of a and b. Therefore, including the UMI data improves inference of a and b, which reduces the error associated with \(\delta\) in our joint inference approach.
Overall, we see that L1+L2 outperforms all other likelihood functions for all parameters including L1+L3, demonstrating the benefits that a fully integrated analysis of timeresolved bursting dynamics using 4sU scRNAseq data provides over more limited, separate treatments of subsets of the parameters by combining scRNAseq (a and b) and bulk SLAMseq (\(\delta\)) information. This is apparent in this example of a gene with moderate expression, high transcriptional noise and a transcript lifetime similar to the 4sU pulse duration.
Inference on data from Qiu
We next applied our method to 4sU scRNAseq data published in 2020 by Qiu et al, which used human K562 cells [38]. Inference on the data from Qiu was carried out for all genes with at least one read and observed T>C conversion in both the 4sU and control datasets, running the MCMC algorithm in parallel on each to obtain a posterior from model 2 or model 3 if required (Methods). The final set of genes to be analysed was selected based on those with sufficient confidence in all parameter estimates. Therefore, a maximum CV value of 0.45 was imposed for all parameter estimates, so that only genes with no CV \(>0.45\) would be included, leaving 584 genes as the final selected set.
For the selected genes we observe that the quality of our estimates depends upon the location of the gene within parameter space, as shown in Fig. 4, which depicts estimate vs CV for all parameters. \(CV(\delta )\) has an optimal (minimum) value for \(\delta\) corresponding to an average transcript lifetime equal to the 4sU pulse duration (4 hours), with confidence decreasing bidirectionally and outliers with very low \(CV(\delta )\) corresponding to genes with \(\mu \ge 1000\). We also have increased confidence in general for genes with higher \(\mu\) since estimates for such genes are informed by a greater volume of data. Likewise, genes with greater b have greater confidence because, firstly, increased b results in higher \(\mu\). Secondly, for a given \(\mu\), having a higher b implies lower a, meaning that the transcriptional noise is higher, resulting in a more heavily skewed transcript count distribution (across cells) which may be more precisely attributed to a region of parameter space. We do not see a visually obvious trend in confidence for a. This is because it is associated with higher expression level but lower transcriptional noise. Therefore, a gene with higher a has more data points with which to inform the estimate but a less skewed transcript count distribution, so that the effects on confidence tend to cancel each other out. The trend in confidence for \(\kappa\) is essentially dictated by the a and \(\delta\) values for the gene.
Instead of relying solely on model 2, for some genes we must switch to an alternative (model 3). This occurs when genes lie within a region of parameter space such that the solution to Eq. 9 becomes unstable. Figure 4 provides evidence supporting the reliability of our inference approach, since the model 2 and 3 genes generally occupy the same regions of the plot and exhibit the same relationships between confidence and estimate for each parameter. This also illustrates the increased probability for a gene to reside within unstable parameter space, and therefore require use of model 3, when \(\mu\) and a are higher and when \(\delta\) is lower.
We reinforce our results by demonstrating a strong positive correlation about the diagonal between our estimates of \(\delta\) and cellmatched values calculated in [36] for the same genes (Additional file 1: Fig. S3). Further assessment of our parameter estimation and confidence quantification was provided by carrying out inference on simulated data. This simulationbased validation differs from the previously described model comparison analysis (Figs. 2 and 3) in that experimental settings, such as cell number, cell capture efficiency and sequencing depth, were equivalent to those in the Qiu dataset rather than being idealised, and the bursting parameter values estimated for each of the 12276 genes we analysed were used as the true values for a corresponding simulated gene. Strong, tight correlations about the diagonal between estimates and true parameter values confirmed the capacity for the algorithm to recover known parameter values (Additional file 1: Fig. S4).
Now that we have estimates for all parameters of interest, it is possible to demonstrate how the different aspects of the data feed into informing the joint probability distribution. Figure 5a illustrates some expected correlations, showing that \(\mu\) correlates very strongly with the mean UMI count and that \(\delta\) correlates very strongly with the 4sU  control T>C rate, since these values reflect the overall activity and turnover of the gene, respectively. We see that a correlates strongly against the CV of the UMI count, which reflects the relationship between bursting and cellcell variability. It is also possible to demonstrate the aforementioned complex relationship between burstiness and the shape of the singlecell T>C count data, but not in a genomewide manner since the effect is masked by variation in \(\mu\) and \(\delta\). Therefore, we instead compare a pair of genes (ATF5 and CAP1) with very similar estimates for \(\mu\) and \(\delta\) but very different values of a (and therefore also b and \(\kappa\)), with ATF5 being expressed in a far more bursty fashion. The estimates for the different parameters of these genes are given by Table 1.
The density plot in Fig. 5a compares the distribution of cellspecific T>C rates (minus genespecific background) across all reads in the cell for the aforementioned pair of genes. There is a clear difference in the shape of the distribution, with the bursty gene having a greater density at either extreme while the gene with less noisy expression has a greater intermediate density. This is because large, infrequent bursting has a binarising effect, meaning that most cells either have a low or high T>C rate. Those with a low rate correspond to those which have had no bursts occur during the 4sU pulse, resulting in their entire transcript population comprising those surviving from before the pulse. Those with a high rate correspond to those which have had at least one burst occur during the pulse. Since the bursts tend to be large, this results in the majority of the transcript pool being comprised of newly synthesised transcripts. On the other hand, smaller, more frequent bursts causes the surviving transcripts to gradually become replaced by new transcripts in a more uniform manner across cells. Similarly to how scRNAseq reveals differences in cellcell variation in transcript counts for two genes with otherwise equal expression levels, 4sU scRNAseq also reveals differences in cellcell variation in new transcript proportions for two genes with otherwise equal decay rates.
Despite controlling for \(\mu\) and \(\delta\) in this pairwise comparison of a high vs low noise gene, the effect of bursting on cellspecific T>C rates shown in Fig. 5a is still somewhat obscured by the variable cellspecific capture efficiencies, \(\alpha\), present in the data. Therefore, datasets were simulated in the same manner as for the model comparison analysis, except \(\lambda _s=0.001\), and \(\alpha =1\) to totally control for the effect of capture efficiencies. Datasets were simulated for a gene with high noise and another with low noise with parameter values set as shown in Table 2.
The differential transition from surviving to new transcript pool for high and low noise genes is demonstrated in Additional file 2, which shows the cellspecific T>C rate distributions for data simulated with different pulse durations. This illustrates the previously discussed effect of bursting on cellcell turnover variation more clearly, visualising the bimodal vs unimodal transitions occurring under high vs low noise conditions with a video.
Biological findings
Correlating the parameter estimates against each other for our 584 genes also reveals that genes with extremely high expression levels, the majority of which are mitochondrial genes, are able to achieve these high levels primarily by having very large bursts, rather than very frequent bursts or very stable transcripts, although the decay rates do appear somewhat constrained (Fig. 5b). There may be biological upper limits on \(\kappa\) due to the various factors required to be in place to prime a gene for activity and, therefore, it may be preferable to instead increase burst duration (reduce \(k_{off}\)), and therefore burst size, for very high expression levels [32]. A similar phenomenon has been observed previously, in which MYC overexpression lead to increased expression in target genes through increased burst duration and size, rather than increased burst frequency [46, 47]. Estimates for \(\kappa\) and \(\delta\) are also positively correlated, despite \(\kappa\) and \(\delta\) varying across several orders of magnitude. This correlation may be the result of a selective pressure to limit the variability of a and, for example, prevent transcriptional noise levels from becoming excessively high. Alternatively, high burst frequencies would correspond to RNAP rapidly processing over the gene, allowing less time to pause for the nascent transcript to be folded/spliced appropriately than with lower burst frequencies [2], resulting in reduced transcript stability.
Histone modifications and bursting
We next explored the relationship between HMs and transcriptional bursting dynamics with a metagene analysis carried out using ChIPseq data for eight previously analysed HMs [20] and two further HMs (Methods). In this analysis, we removed mitochondrial genes and genes for which we lacked HM data from our set with high confidence parameter estimates, with 505 genes ultimately being included. Of the eight previously studied HMs we analysed, the profiles generally fall into the two previously described categeories [20], being either predominantly promoterlocalised (H3K4me2, H3K4me3, H3K9ac, H3K27ac, Fig. 6 and Additional file 1: Figs. S68) or gene body (GB)localised (H3K4me1, H3K36me3, H3K79me2, H4K20me1, Additional file 1: Figs. S913). To better understand the association between HM profile and bursting parameters, the genes were split in half, sorted by parameter estimate for each of the five parameters. Metagene comparison reveals positiondependent associations for promoterlocalised HMs, using H3K4me2 as an example (Fig. 6). It appears that HM presence at the promoter and through the GB is associated with increased \(\mu\) and also a, while increased \(\kappa\) is specifically associated with promoter but not GB presence. Conversely, presence through the GB excluding the promoter region appears associated with increased b and reduced \(\delta\).
This analysis builds upon a previous scRNAseq study which correlated bursting parameter estimates with HM localisation by averaging the ChIPseq coverage from 2000 bp upstream of the TSS to the TES for each gene [20]. They were unable to obtain estimates of \(\kappa\) or \(\delta\) due to a lack of published data on transcript turnover rates for the cell type (hESCs). Our results are in agreement with [20] despite having a different cell type, but additional complexities are revealed which are only apparent with our metagene analysis combined with the capacity to estimate \(\kappa\) and \(\delta\) afforded by 4sU scRNAseq. For promoterlocalised HMs, they report positive associations between HM presence and both a and b, whilst we demonstrate that the association with b is specific to the GB. We confirm that the association with a holds throughout both the promoter and GB, but show that this is a result of a promoterspecific positive \(\kappa\) association and a GBspecific negative \(\delta\) association, thereby further demonstrating the advantages of 4sU scRNAseq inference.
In order to statistically test these apparent associations, the average HM coverage values around the promoter and through the GB excluding the promoter were obtained for each HM (Methods), taking the average value from 2000 bp upstream of the TSS to 5% through the GB (2000:5%) and from 5% through the GB to the TES (5%:100%), respectively. Spearman’s rank correlation of the mean value for each promoterlocalised HM against each parameter across our 505 genes confirmed the direction and quantified the strength (Fig. 7a), as well as confirmed the statistical significance of the suspected associations (Fig. 7b).
The association between promoterlocalised HM presence and reduced decay rate is consistent with previous reports of a link between HMs and preRNA processing. The RNAP elongation speed may be modulated by HMs or they may be responsible for the recruitment of splicing factors [48, 49]. This could result in more stable RNA by ensuring correct splicing and/or polyadenylation. GB presence of promoterlocalised HMs could also result in increased burst size by facilitating TSSTES contact through the maintenance of the open chromatin state around the TES. Coupled with the free movement of RNAP through the GB, this may increase the burst size by allowing RNAPs to quickly and repeatedly generate multiple transcripts by promoting polymerase recycling [14]. Another hypothesis is the presence alternative TSSs in the GB, which may transcribe simultaneously upon gene activation, causing increased burst size, with 483 out of the 505 analysed genes exhibiting alternative TSSs according to the gtf. Metagenes for the rest of the aforementioned HMs along with the correlation/statistical analysis of the GBlocalised HMs can be found in Additional file 1, as well as analysis of two additional HMs (H3K18ac and H4K16ac) not analysed in [20] (Additional file 1: Figs. S1416), but which were shown to be strongly linked to active enhancer regions [50].
Discussion
With the inference approach presented here, we demonstrate the capacity to obtain genomewide estimates of the parameters governing transcriptional bursting dynamics and the timescales upon which they occur from a single dataset with no prior knowledge. By sampling from the full joint probability distributions of the parameter values given the data, we are able to quantify confidence in our estimates and take into account the complex interdependencies between the different parameters and 4sU scRNAseq data, revealing the regions of parameter space for which we have the most accurate and precise estimates. We show that the distribution of 4sUinduced T>C conversions across cells is shaped not only by the turnover rate and expression level of the gene but also by the transcriptional noise and that this information can therefore be used to improve estimates of burst rate (a) and burst size (b) beyond the level obtainable with conventional scRNAseq. In this way, combining metabolic labelling and single cell resolution has an effect greater than the sum of their parts on inference power. Previous analysis of transcriptional bursting using 4sU scRNAseq data has tapped into this idea by estimating the proportion of new transcripts (based on T>C conversions) in each cell for a particular gene and then using the standard deviation of this new to total ratio as a proxy for burstiness [43]. However, as clearly demonstrated by the video in Additional file 2, this distribution, and therefore its standard deviation, is shaped not only by transcriptional noise but also by RNA turnover and may be skewed by technical noise such as variation in capture efficiency. Therefore, along with the overall expression level, this needs to be explicitly accounted for in order to accurately quantify burstiness, as is naturally achieved with our mathematical model.
Having genomewide estimates of the parameters governing transcriptional dynamics means that it is possible to use the variation which naturally exists between genes to examine the relationships between the different parameters and other features, such as HMs, instead of having to rely on experiments which artificially perturb the cells to gain insight via a single gene system. In agreement with previous reports [42], we find that the genes with very high expression levels are primarily mitochondrial genes. Going beyond this, we show that such activity levels are achieved by having large burst sizes rather than increased RNA stability or burst frequency, which we hypothesise could be due to biological constraints on the rate of switching between active and inactive states [32], potentially making it favourable to instead increase the duration of bursts, and therefore the burst size, as has similarly been observed for MYCdriven transcription [46, 47]. Whereas some studies have found the variation in decay rates (in mESCs) across genes to be an order of magnitude lower than for the other parameters, and therefore negligible [32], we found significant variation in K562 cells which was important to account for in order to properly estimate burst frequencies. This is in line with previous predictions that transcript stability plays an important role in modulating gene expression noise [45]. Indeed, our analysis revealed an unexpected positive correlation between burst frequency and decay rate, resulting in the burst rate, and therefore transcriptional noise, being constrained. One may speculate that only noise levels within a certain range are tolerated, with extreme values resulting in too few cells expressing the gene for a given function to be achieved, such as the appropriate proportion of cells in an isogenic population undergoing differentiation [11, 51], manifesting as the observed correlation. A mechanistic, rather than evolutionary, explanation is that high burst frequencies result in rapid flux of RNAP through the gene, such that less time is allowed for pausing, during which appropriate folding and/or splicing of the nascent transcript is facilitated [2]. This would would reduce transcript stability and cause the observed correlation.
Examining the relationship between bursting parameters and HMs genomewide produced results consistent with but advancing upon previous work [20]. Combining our metagene analysis with the additional information provided by 4sU scRNAseq over inference on conventional scRNAseq reveals intricacies that were not previously apparent. The presence of GBlocalised HMs throughout the gene is generally associated with increased burst rate (bursts per transcript lifetime) via increased burst frequency (bursts per minute), while promoterlocalised HMs are only associated with increased burst frequency when found around the TSS. Their presence further downstream remains associated with increased burst rate, and therefore reduced transcriptional noise, but through reduced decay rate rather than increased burst frequency. The association with reduced decay rate may be related to the previously documented influence of HMs on preRNA processing, which is achieved, for example, by modulating RNAP elongation speed and/or by recruiting splicing factors [48, 49]. This may increase RNA stability by reducing the probability of incorrect splicing or polyadenylation. Presence of promoterlocalised HMs throughout the GB but not at the TSS is also associated with increased burst size. Downstream presence could facilitate interactions between the TSS and the TES by maintaining the open chromatin state around the TES. This, along with maintaining the free movement of RNAP through the GB, could promote polymerase recycling and therefore increased burst size by allowing RNAPs to quickly and repeatedly fire off multiple transcripts during an active period [14]. Another possible explanation for the association between promoterlocalised HM presence in the GB with burst size is that there are multiple, alternative TSSs found within genes. These may initiate transcription in a nonindependent manner, leading to increased burst size when there are more/stronger alternative TSSs, as signalled by HM presence.
The inference approach described here is generally applicable to 4sU scRNAseq datasets which have RNA spikeins and UMIs for any organism or cell type. Furthermore, the model could easily be expanded to integrate an arbitrarily large number of repeat experiments by extending the Markov chain according to the product of the likelihood functions of each dataset. Indeed, such a scheme which utilised datasets with different 4sU pulse durations could theoretically characterise the transcriptional dynamics of all genes genomewide. For example, inference carried out using two datasets with long and short pulse durations would facilitate estimates for genes with long and short transcript lifetimes, respectively, along with everything in between. A caveat of our analysis is the asynchronisation of the cell cycle phase across the population. This may confound the results in two ways, firstly because different phases have a different cellular environment, influencing the global transcriptional dynamics and causing variation in the underlying parameter values for the same gene between cells in different phases. Secondly, there is variation in the copy number of genes throughout the cell cycle, with an unknown proportion of cells having one or two copies of each nuclear gene. Confounding effects on the inference could be resolved by separation of the different subpopulations of cells by cell cycle phase using, for example, fluorescenceactivated cell sorting prior to sequencing [33], and/or by using allelespecific/sensitive scRNAseq approaches combined with metabolic labelling [17, 34]. As 4sU scRNAseq data becomes more common place and there are improvements in capture efficiencies, sequencing depths and cell numbers, it will be possible to robustly infer timeresolved transcriptional bursting dynamics for a far greater number of genes from a single experimental set up. Our findings on burst dynamics and their associations with HMs could be a valuable starting point to inform future experimental work investigating this area, while further application of our method beyond what is presented here might hint at other, novel mechanistic relations.
Conclusions
In conclusion, we have developed a mathematical model to maximally exploit the power of 4sU scRNAseq datasets to examine transcriptional bursting, tapping into the synergy between singlecell resolution and 4sU labelling which manifests in the cellspecific T>C rate distributions. The advantages over conventional scRNAseq were demonstrated in detail using smallscale simulations and performance of the algorithm across parameter space was validated with largescale simulations. We applied our inference approach to published 4sU scRNAseq data to obtain genomewide joint parameter estimates and confidence quantifications, finding an unexpected correlation between burst frequency and decay rate, and that genes with extremely high expression levels achieve this primarily through increased burst size. Finally, we linked our estimates with published ChIPseq data, revealing positiondependent associations between different histone modifications and parameter estimates which only become apparent with 4sU scRNAseq as opposed to conventional scRNAseq.
Methods
Data processing and analysis
4sU scRNAseq
The main datasets that were used for parameter inference in this study were produced in Qiu et al 2020 [38], downloaded from the GEO series GSE141851. Two datasets from this series were used, both using K562 cells; a negative control dataset with TFEA chemical conversion treatment but with no 4sU added, and another dataset which had 4sU added 4 h before chemical treatment, with GEO sample IDs GSM4512696 and GSM4512697, respectively. These are Dropseq datasets and thus were processed according to the “Dropseq alignment cookbook” (https://mccarrolllab.org/wpcontent/uploads/2016/03/DropseqAlignmentCookbookv1.2Jan2016.pdf). A custom Python script was used to carry out trimming of read pairs with any base with phred quality \(\le 10\), and to clip adaptor and polyA tail sequences. The trimmed reads were then aligned to the primary human genome assembly (GRCh38.p13), the fasta file for which was obtained from gencode (https://www.gencodegenes.org/human/), using bwa to build the genome index and for the actual alignment [52]. Custom Python scripts were then used to map the aligned reads with mapq score \(\ge 10\) to their genes according to the gencode.v36 primary human genome assembly annotation gtf file, before extracting cellspecific (using the cell ID part of the read 1 barcode) UMI counts and total read counts for each gene, along with genespecific, cellspecific information for each read about the number of genomic T bases (found in the fasta sequence across the aligned read positions) and the number of those which were converted to C bases in the read sequence. Cell selection was then carried out to exclude those cell IDs corresponding to empty droplets by ordering the cell IDs based on the total number of corresponding read pairs and then selecting the top 400 or 795 IDs for the control and 4sU dataset, respectively, as specified in [38]. The control dataset was then used to derive the genespecific background T>C conversion rates, \(\lambda _s\), based on the proportion of genomic Ts which were converted to Cs across all reads across all selected cells for the given gene. Figure 8a provides a schematic overview of how the T>C conversion data arises from the 4sU scRNAseq experimental protocol.
ChIPseq
Publicly available ChIPseq datasets for ten active HMs produced with K562 cells were downloaded for our analysis. A H3K4me3 ChIPseq dataset was obtained from the GEO series GSE108323 with sample ID GSM2895356, which had been processed with alignment to the hg19 human genome build [53]. Seven more ChIPseq datasets, which had also been processed with alignment to the hg19 human genome build, were obtained from the GEO series GSE29611 with sample IDs GSM733778, GSM733651, GSM733653, GSM733656, GSM733675, GSM733692 and GSM733714, corresponding to H3K9ac, H3K4me2, H3K79me2, H3K27ac, H4K20me1, H3K4me1 and H3K36me3, respectively [54]. Two additional ChIPseq datasets for H3K18ac (aligned to hg19) and H4K16ac (aligned to hg38) were obtained from the series GSE106964 and GSE158736 with sample IDs GSM2862934 and GSM4809274, respectively [55, 56]. The position and read count information from these datasets was used to obtain the singlebase resolution coverage values for each HM. These values were associated with their corresponding genes using the information from the comprehensive gene annotation hg19 (or hg38 for H4K16ac) gtf downloaded from Gencode. Analysis of the correlations between bursting parameter estimates and HM coverage at different sections of the gene was carried out by taking the average coverage value for all bases across the specified section (e.g. from 2k bp upstream of the TSS to the TES) for each gene, so a single value is obtained per gene per HM. Metagene plots were produced by averaging the coverage values for each position through/around the gene across all specified genes, similarly to the metagene analysis described in [57].
Mathematical modelling
In general, we model bursty transcription as a stochastic process closely related to the standard twostate model, as many previous works have [9, 10, 14]. The twostate model has four possible processes of gene activation, gene repression, transcription and degradation, where transcription may only occur with the gene in an active state while degradation acts continuously. This is represented by the following chemical reaction scheme
in which \(k_{on}\), \(k_{off}\), \(\beta\) and \(\delta\) represent the rate constants for gene activation, gene repression, transcription and RNA degradation, respectively, while \(G_{off}\), \(G_{on}\) and RNA represent the different species of repressed gene, active gene and transcript, respectively. A schematic representation of the system is shown in Fig. 8b. Gene activation is known to involve facilitated diffusion, in which TFs not only diffuse through the cytoplasm but can also slide along the DNA after becoming associated, to find the target/activation site. This makes the process more complicated than the simple on/off switch of our model, which reflects TF association/disassociation events without DNA sliding. However, previous modelling work has shown that the twostate model accurately captures gene activation due to the high speeds at which DNAbinding proteins slide along the DNA observed in biological systems, which ensures that they do not influence the transcriptional bursting dynamics [58].
With this model, we have burst frequency, \(\kappa =\frac{1}{(1/k_{on})+(1/k_{off})}\) and burst size, \(b=\frac{\beta }{k_{off}}\), and we recall the burst rate, \(a=\frac{\kappa }{\delta }\). Aiming to understand bursting and its timescales specifically, we make the assumption that bursts occur instantaneously, arrive according to a Poisson process and burst in a geometric fashion, which is valid when \(\delta<< k_{off}\) since a transcript produced in a given burst is unlikely to have degraded before the burst is over [7, 59], and when \(k_{on}<<k_{off}\), which is supported by the parameter estimates reported in [32]. This model simplifies \(\kappa =\lim \limits _{k_{off}\rightarrow \infty }\frac{1}{(1/k_{on})+(1/k_{off})}=k_{on}\) while b remains finite with \(b=\lim \limits _{\beta ,k_{off}\rightarrow \infty }\frac{\beta }{k_{off}}\) [60].
Model 1
The first model aims to model the observed unique molecular identifier (UMI) counts of a given cell, l, from the estimated capture efficiency (Additional file 1: Fig. S1) of that cell, \(\alpha\), in a similar fashion to the technical noise model outlined in [61]. The capture efficiency, \(\alpha\), represents the transcript detection rate for that cell (probability of at least one read corresponding to a particular transcript). Based on the instantaneous bursting version of the twostate model described above, the steady state distribution of the transcript count, m, can be derived directly from the master equation and corresponds to the negative binomial distribution [10, 59, 60, 62]
which is illustrated by the schematic in Fig. 9, where
The full derivation is available in [60]. We may then model the probability distribution of observing l UMIs given m transcripts in the cell with a capture efficiency of \(\alpha\), as a poisson approximation of the true binomial process
where
which is valid when \(\alpha\) is small. We model the observed data, linked by the unobserved steady state transcript distribution by compounding Eqs. 1 and 2 across the state space of m and marginalise
where M is an upper bound corresponding to the 0.9999 quantile of Eq. 1, which avoids summing to \(\infty\), achieving a finite state projection (FSP) [63, 64] with an error of 0.0001. The resulting distribution and its dependence on P(m) is shown in Fig. 9. The approximation in Eq. 2 permits nonzero probability values when \(m<l\), which allows our MCMC algorithm to more efficiently escape regions of parameter space for which \(M<l\). This leads us to the likelihood function of model 1 by taking the product of Eq. 3 across all cells in the data
where \(l_c\) and \(\alpha _c\) represent the observed UMI count (for the given gene) and capture efficiency for cell c, respectively, and \(L=(l_1,\ldots ,l_k)\), with k cells in total in the data and \(\theta =(\mu ,a,\gamma )\). Since we wish to infer the values of \(\theta\) for each gene from the data using this model, we aim to obtain the posterior
which we achieve through MCMC sampling.
Model 2
We will now construct a model which unifies the UMI and T>C conversion aspects of the data with the aim of understanding both bursting dynamics and the timescale upon which they occur. Figure 8a illustrates how the T>C conversion data arises from the experimental protocol. First of all, we define \(\tau =t\delta\) where t is the time before sequencing at which the 4sU nucleotides were added to the cells, otherwise known as the pulse duration. \(\tau\) therefore represents unitless time in terms of transcript lifetimes. Next, we must obtain the probability mass function of the number of transcripts surviving to the sequencing point which were produced before the 4sU was added, otherwise known as the surviving transcripts, s. This distribution, P(s), may be understood as the timedecay of the steady state distribution, P(m), where we have \(\lim \limits _{t\rightarrow \infty }P(s=0)=1\) and \(P(s\vert t=0)=P(m)\) when \(\delta >0\). Degradation acts upon each individual transcript molecule with rate \(\delta\), and therefore the probability of a given transcript produced before 4sU was added surviving is \(1F_{Exp}(X \le t \vert \delta )=f_{Pois}(0\vert \tau )\). Therefore, the probability of having s transcripts surviving given m originally is
where
and
giving the conditional distribution of s. Compounding this with the steady state distribution (Eq. 1) we obtain the marginal
We compute this distribution efficiently by using the approximation
Next, we obtain the probability mass function of the newly synthesised transcript count, P(n), for those transcripts that were produced after the 4sU was added and therefore have a higher T>C conversion rate than the background. This may be understood in reverse to P(s), as it describes the convergence of the newly synthesised transcript count from a point mass at zero to the steady state distribution where we have \(P(n=0 \vert t=0)=1\) and \(\lim \limits _{t\rightarrow \infty }P(n)=P(m)\) when \(a, b, \delta > 0\). An approximate solution to such a distribution was derived as a model of translation in [59] though the assumed relationships apply here. The solution is
which is valid when \(k_{off}>>\delta\) and \(\tau>>\delta /k_{off}\), where \({}_2F_1\) refers to the hypergeometric function. The general dependency of the surviving and new transcript distributions on P(m), as dictated by \(\delta\), is illustrated by Fig. 9. Next, we obtain the probability distribution of transcripts at steady state conditional on our observed cellspecific capture efficiency, \(\alpha\), and UMI count, l, by using Eqs. 1 and 2
Now, we describe the probability distribution of n conditional on m as the joint distribution of n and s
with the convolution \(\sum _{n=0}^m P(n)P(s=mn)\approx P(m)\) being used as a normalising value in place of P(m) due to the approximate nature of P(n), ensuring that \(\sum _{n=0}^mP(n\vert m)=1\). It is now possible to model the number of T>C conversions observed in a given read conditional on m, where we have expanded and built upon the poisson mixture model of conversions described in [36] and compounding with Eq. 11
where P(u) is the genespecific empirical probability mass function of observing u uracils across the fasta sequence corresponding to a given read’s mapping position. \(\lambda _s\) is the genespecific background conversion rate observed in the control dataset (without the addition of 4sU) which represents conversion due to random mutations or other sources outside of chemical conversion. \(\lambda _n\) is the geneinvariant conversion rate due to 4sU incorporation and conversion which was estimated from the data (Additional file 1: Fig. S2). P(u) and the conditional T>C count distribution are shown in Fig. 9, along with the dependence of \(P(i\vert m)\) on P(u), P(s) and P(n) as dictated by \(\lambda _n\) and \(\lambda _s\). We may now model the cellspecific T>C conversion rate for the given gene by compounding Eqs. 10 and 12
where M is an upper bound corresponding to the 0.9999 quantile of Eq. 1, again giving a FSP with error 0.0001. We are finally in a position to complete the model and link all our observables together. The observed counts of conversions in each cell may be represented by y, where \(y_i\) is the number of reads that have i conversions. Therefore, the cellspecific observed distribution of conversions per read may be understood as a multinomial distribution with a probability vector determined by Eq. 13
enabling us to model the conversion data conditional on the UMI data. A likelihood function may now be obtained with
where \(y_c\) is the conversions per read distribution observed in cell c and \(Y=(y_1,\ldots ,y_k)\) where \(y_{c,i}\) is the number of reads with i conversions in cell c for the given gene. The final likelihood function of model 2 is now defined as the product of Eqs. 4 and 15
As in Eq. 5, MCMC sampling was used to obtain
One thing to note about model 2 is that Eq. 9 is an approximate solution and breaks down in certain regions of parameter space. When a and/or b become too large and/or \(\tau\) becomes too small, the function will oscillate around the true probability distribution function, with these oscillations quickly becoming more extreme to the point that the approximate solution gives negative probability values. The solution can be said to become unstable in these regions of parameter space, and therefore such regions will be referred to as unstable parameter space. If a gene is found to reside within an unstable region of parameter space then an alternative to model 2 must be used.
Model 3
Our final model acts as an alternative to model 2 when a gene resides within an unstable region of parameter space. Unlike model 2, this model ignores the cellspecific T>C information in favour of simply pooling the conversions across all cells. We define the probability distribution of observing i conversions for a given read
This is similar to Eq. 12 but is independent of the total transcript count, m, and is therefore not cell specific. We can apply Eq. 18 to the full set of observed conversions across cells, Y, again using the multinomial distribution to obtain a likelihood function
where \(y_i\) represents the number of reads with i conversions summed across all cells rather than being a cellspecific value as in Eqs. 14 and 15. We define the final likelihood function of model 3 as the product of Eqs. 4 and 19.
As in Eqs. 5 and 17, MCMC sampling was used to obtain
Markov chain Monte Carlo algorithm
MCMC was employed in order to sample from the posterior distributions outlined in Eqs. 5, 17 or 21 using a Metropolisadjusted Langevin algorithm (MALA) within a Gibbs sampler, which simulates a Markov chain using Langevin dynamics [65] and corrects the EulerMaruyama integration error with an acceptreject step as with the MetropolisHastings algorithm [66]. The chain is initialised semirandomly, setting \(\theta ^{(1)}\) in a manner which takes advantage of the information immediately available from the data to start the chain relatively close to the target density. We calculate empirical estimates of the expression level, \(\mu\), and transcript lifetime, \(\gamma\), as
where \(N=795\) is the number of cells in the dataset, and as
where \(\lambda\) is the observed conversion rate for the given gene across all reads, while \(\lambda _s\) and \(\lambda _n\) represent the background conversion rate measured in the control dataset and the estimated 4sUmediated conversion rate, respectively. We then set \(\mu =\hat{\mu }\) and draw
and
where
with support [y, z] for \(y>0\) and
We repeatedly draw \(\theta ^{(1)}\) in this way until \(P(X\vert \theta ^{(1)})P(\theta ^{(1)})>0\) where X is the dataset and \(P(\theta )\) represents the prior distribution, which in this case is defined to be an uninformative multivariate uniform distribution such that
where
with support [y, z]. At each step, j, in the Markov chain, the next step is sampled by proposing jumps to new positions in parameter space from the current position, proceeding through three dimensional parameter space with \(\theta =(\mu ,a,\gamma )\). This parameterisation was chosen for Markov chain progression to minimise correlations between parameters and proposals to negative (unsupported) values. The classic MetropolisHastings algorithm [66] corresponds to a random walk through parameter space, which converges relatively slowly to the target density, and which samples from the posterior inefficiently due to slow mixing of the chain, with the optimal acceptance rate (proportion of accepted proposals) being only 0.234 [67]. Therefore, we make use of the MALA as a superior alternative, which converges much more efficiently, requiring only \(O(d^{1/3})\) steps, where d is the dimension of the target density, whereas the random walk requires O(d) steps, while the higher optimal acceptance rate of 0.574 allows for faster mixing and reduced dependence between samples [65]. The Markov chain is treated as an itô diffusion and behaves according to Langevin dynamics with stochastic differential equation
evolving \(\theta\) in imaginary time with a standard Brownian motion diffusion term, W, and a drift term determined by the vector gradient, \(\nabla\), of the logarithm of the posterior density, \(\pi (\theta )\propto P(X\vert \theta )P(\theta )\), with respect to \(\theta\) evaluated at \(\theta _t\). However, we do not have an analytical solution for \(\nabla \log \pi (\theta )\) which means we must estimate this numerically using the change in likelihood observed between the current step, j, and the previous one when generating a proposal. This leads to an additional complication, wherein we may not propose a new sample for all parameters simultaneously since then the observed change in likelihood would be the combined effect of the change in each parameter, making the individual gradients impossible to estimate. Therefore, we must sequentially update each parameter conditional on the current value of all other parameters, which are treated as fixed constants. This corresponds to embedding our MALA within a Gibbs sampler [68, 69], meaning that d substeps are required to move from step j to \(j+1\). At step j, we cycle through each parameter, k, from 1 to d, and draw a new proposal for parameter k from a proposal distribution as determined by Eq. 22
where \(\xi\) is a standard normal random variable and S is an adaptive scaling constant such that the proposal is drawn from
This is accepted with a probability given by the likelihood ratio at the proposed and current value
where substituting \(\pi (\theta )\) for \(P(X\vert \theta )P(\theta )\) gives an equivalent ratio due to the proportionality, which allows us to refer directly to the target density, \(\pi\). Note that the intractable integrals in the denominators of Eqs. 5, 17 and 21 cancel out to allow the acceptance probability to be calculated with only the likelihood function and the prior density. In our special case with uniform priors, these also cancel, only serving to reject proposals outside of the plausible ranges of parameter space as defined by the prior. With probability A we set \(\theta _k^{(j+1)}=\theta _k^{(*)}\), otherwise \(\theta _k^{(j+1)}=\theta _k^{(j)}\) and since we treat parameters other than \(\theta _k\) as constants, we iteratively draw \(\theta\) from the conditional rather than joint densities as
If the proposal is accepted, we update our estimate of the local gradient for the parameter k as
otherwise we set \(\nabla _k=0\). We also recursively update the adaptive scaling constant associated with parameter k in the manner described for the Adaptive Scaling Metropolis algorithm of [67]
with a recursively updated decay term
which in the longterm results in the MALA mixing close to the optimal parameterspecific acceptance rate of 0.574 [65]. At step 1, we initialise \(\nabla =0\), \(S=\theta ^{(1)}/100\) and \(\eta =0.1\). Density plots indicate that in the vast majority of cases acceptance rates close to 0.574 were achieved (Additional file 1: Fig. S5).
The process repeats until 5000 steps have been completed (\(j=5000\)) if \(\hat{\mu }<1000\) or 1500 if \(\hat{\mu }\ge 1000\), since for these genes with very high expression levels each step takes longer but the stronger evidence means that fewer steps are required. Therefore, the Markov chain converges to the posterior distribution according to its gradient. Posteriors were produced from the sampled chain using the last 1000 or 2500 steps for high expression or other genes, respectively, with a thinning factor of 2, where only every 2nd point in the chain is used in order to reduce dependency between points, resulting in smoother posterior densities and sample sizes of 500 or 1250. When using model 2, for each step, we check if the proposal for any substep was rejected because of negative probability values appearing in Eq. 9 due to the approximate nonequilibrium solution failing for an unstable point in parameter space. We set a rolling window size, w, equal to 100 or 500 for high expression or other genes, respectively. We then check at each step, j, if the number of steps with a rejection of this nature is \(\ge w/20\) for steps \([\max ((w/2)+1,jw+1),j]\) and if this condition is met then the Markov chain is restarted using model 3 instead of model 2.
Simulations for model comparison
The performance of inference using different likelihood functions was tested on simulated data. Gillespie’s exact algorithm (stochastic simulation algorithm) [70] was used to simulate data according to the reactant matrix shown in Table 3 and the product matrix shown in Table 4, with the stoichiometry matrix shown in Table 5.
This allows simulation of the pool of transcripts in the cell that was synthesised before (\(RNA_0\)) and after (\(RNA_1\)) the 4sU pulse started. The simulation is run with initial conditions \(X_0=(0,0,0,1)\) where \(X=(RNA_0,RNA_1,G_{on},G_{off})\) and rate constant values are set for bursty expression \(\theta =(\beta _0=50,\beta _1=0,\delta _0=0.001,\delta _1=0.001,k_{on}=0.0005,k_{off}=1)\), running until \(t_0=200000\) to bring the system to steady state. The system state at the end of this run, \(X_{t_0}\), is then used as the initial condition for a second run, where we now set \(\theta =(\beta _0=0,\beta _1=50,\delta _0=0.001,\delta _1=0.001,k_{on}=0.0005,k_{off}=1)\) to simulate the newly synthesised transcripts produced during the 4sU pulse along with decay of preexisting transcripts. A pulse duration of \(t_1=1000\) min was used here, giving the final state of the system \(X_{t_1}\), and importantly giving the counts for \(RNA_0\) and \(RNA_1\) in the cell. This was repeated to simulate \(N=10000\) cells. Insilico sequencing data was then generated based on these simulated transcript count values. Cellspecific capture efficiencies were drawn
before drawing the cellspecific UMI counts, l, corresponding to the two pools of transcripts as
for \(k=0\) and \(k=1\), so that the total UMI count for the given cell is \(l=l_0+l_1\). The cellspecific total number of reads corresponding to each UMI in the two pools is then drawn
where \(\nu =5\) represents sequencing depth and reads per UMI is a zerotruncated Poisson random variable with
using the same logic of Poisson assignment of reads to UMIs as in [71]. Then, the cellspecific total number of reads of the given pool is
The number of uracils across the sequenced part of the transcript is then drawn for each read
where \(\hat{u}=60\) is the average number of uracils per read. The number of conversions in each read in the cell is then drawn for the two pools of transcripts as
and
where we set \(\lambda _s=0.01\) and \(\lambda _n=0.075\). The overall conversion data across all reads in the cell is then \(i=(i_0,i_1)\), where \(i_0=(i_{0,1},\ldots ,i_{0,r_0})\) and \(i_1=(i_{1,1},\ldots ,i_{1,r_1})\), from which we obtain y, where \(y_i\) is the number of reads with i conversions in the given cell. Now we have our simulated dataset which we can use to demonstrate our capacity to recover known parameter values. MCMC was carried out with different likelihood functions in the previously described manner to sample posterior distributions.
Availability of data and materials
The inference algorithm has been made available as a GitHubdistributed R package (https://github.com/hebenstreitLab/burstMCMC) [72]. The preprocessing scripts are similarly available in a GitHub repository (https://github.com/hebenstreitLab/burstMCMCpreprocessing) [73]. Both repositories are licensed under the MIT license. The scripts along with the processed data used in the analysis are available in Zenodo (https://doi.org/10.5281/zenodo.7707970) under the MIT license [74]. The raw 4sU scRNAseq data used can be found under the GEO sample IDs GSM4512696 and GSM4512697 [38, 75]. The processed scRNAseq dataset used for calculating capture efficiencies can be found under the GEO sample ID GSM1599501 [76, 77]. The processed ChIPseq datasets used for metagene analysis can be found under GEO sample IDs GSM2895356, GSM733651, GSM733653, GSM733656, GSM733675, GSM733692 GSM733714, GSM733778, GSM2862934 and GSM4809274 [53,54,55,56, 78,79,80,81].
References
Wissink EM, Vihervaara A, Tippens ND, Lis JT. Nascent RNA analyses: tracking transcription and its regulation. Nat Rev Genet. 2019;20(12):705–23.
Chen FX, Smith ER, Shilatifard A. Born to run: control of transcription elongation by RNA polymerase II. Nat Rev Mol Cell Biol. 2018;19(7):464–78.
Kuehner JN, Pearson EL, Moore C. Unravelling the means to an end: RNA polymerase II transcription termination. Nat Rev Mol Cell Biol. 2011;12(5):283–94.
Proudfoot NJ. Transcriptional termination in mammals: stopping the RNA polymerase II juggernaut. Science. 2016;352(6291):aad9926.
Paulsson J. Models of stochastic gene expression. Phys Life Rev. 2005;2(2):157–75.
Swain PS, Elowitz MB, Siggia ED. Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci. 2002;99(20):12795–800.
Lin YT, Galla T. Bursting noise in gene expression dynamics: linking microscopic and mesoscopic models. J R Soc Interface. 2016;13(114):20150772.
Fukaya T, Lim B, Levine M. Enhancer control of transcriptional bursting. Cell. 2016;166(2):358–68.
Dar RD, Razooky BS, Singh A, Trimeloni TV, McCollum JM, Cox CD, et al. Transcriptional burst frequency and burst size are equally modulated across the human genome. Proc Natl Acad Sci. 2012;109(43):17454–9.
Kim JK, Marioni JC. Inferring the kinetics of stochastic gene expression from singlecell RNAsequencing data. Genome Biol. 2013;14(1):R7.
Losick R, Desplan C. Stochasticity and cell fate. Science. 2008;320(5872):65–8.
Dieci G, Bosio MC, Fermi B, Ferrari R. Transcription reinitiation by RNA polymerase III. Biochim Biophys Acta (BBA)Gene Regul Mech. 2013;1829(34):331–41.
Shandilya J, Roberts SG. The transcription cycle in eukaryotes: from productive initiation to RNA polymerase II recycling. Biochim Biophys Acta (BBA)Gene Regul Mech. 2012;1819(5):391–400.
Cavallaro M, Walsh MD, Jones M, Teahan J, Tiberi S, Finkenstädt B, et al. 3’5’ crosstalk contributes to transcriptional bursting. Genome Biol. 2021;22(1):1–20.
Dey SS, Foley JE, Limsirichai P, Schaffer DV, Arkin AP. Orthogonal control of expression mean and variance by epigenetic features at different genomic loci. Mol Syst Biol. 2015;11(5):806.
Andersson R, Sandelin A. Determinants of enhancer and promoter activities of regulatory elements. Nat Rev Genet. 2020;21(2):71–87.
Sun M, Zhang J. Allelespecific singlecell RNA sequencing reveals different architectures of intrinsic and extrinsic gene expression noises. Nucleic Acids Res. 2020;48(2):533–47.
Benayoun BA, Pollina EA, Ucar D, Mahmoudi S, Karra K, Wong ED, et al. H3K4me3 breadth is linked to cell identity and transcriptional consistency. Cell. 2014;158(3):673–88.
Nicolas D, Zoller B, Suter DM, Naef F. Modulation of transcriptional burst frequency by histone acetylation. Proc Natl Acad Sci. 2018;115(27):7153–8.
Wu S, Li K, Li Y, Zhao T, Li T, Yang YF, et al. Independent regulation of gene expression level and noise by histone modifications. PLoS Comput Biol. 2017;13(6):e1005585.
Chong S, Chen C, Ge H, Xie XS. Mechanism of transcriptional bursting in bacteria. Cell. 2014;158(2):314–26.
Ma L, Gao Z, Wu J, Zhong B, Xie Y, Huang W, et al. Cocondensation between transcription factor and coactivator p300 modulates transcriptional bursting kinetics. Mol Cell. 2021;81(8):1682–97.
Engl C, Jovanovic G, Brackston RD, KottaLoizou I, Buck M. The route to transcription initiation determines the mode of transcriptional bursting in E. coli. Nat Commun. 2020;11(1):1–11.
Dobrinić P, Szczurek AT, Klose RJ. PRC1 drives Polycombmediated gene repression by controlling transcription initiation and burst frequency. Nat Struct Mol Biol. 2021;28(10):1–14.
Popp AP, Hettich J, Gebhardt JCM. Altering transcription factor binding reveals comprehensive transcriptional kinetics of a basic gene. Nucleic Acids Res. 2021;49(11):6249–66.
Maynard KR, Tippani M, Takahashi Y, Phan BN, Hyde TM, Jaffe AE, et al. dotdotdot: an automated approach to quantify multiplex single molecule fluorescent in situ hybridization (smFISH) images in complex tissues. Nucleic Acids Res. 2020;48(11):e66.
Li G, Neuert G. Multiplex RNA single molecule FISH of inducible mRNAs in single yeast cells. Sci Data. 2019;6(1):1–9.
Moffitt JR, Hao J, Wang G, Chen KH, Babcock HP, Zhuang X. Highthroughput singlecell geneexpression profiling with multiplexed errorrobust fluorescence in situ hybridization. Proc Natl Acad Sci. 2016;113(39):11046–51.
Shah S, Takei Y, Zhou W, Lubeck E, Yun J, Eng CHL, et al. Dynamics and spatial genomics of the nascent transcriptome by intron seqFISH. Cell. 2018;174(2):363–76.
Gorin G, Wang M, Golding I, Xu H. Stochastic simulation and statistical inference platform for visualization and estimation of transcriptional kinetics. PLoS ONE. 2020;15(3):e0230736.
Tunnacliffe E, Corrigan AM, Chubb JR. Promotermediated diversification of transcriptional bursting dynamics following gene duplication. Proc Natl Acad Sci. 2018;115(33):8364–9.
Larsson AJ, Johnsson P, HagemannJensen M, Hartmanis L, Faridani OR, Reinius B, et al. Genomic encoding of transcriptional burst kinetics. Nature. 2019;565(7738):251–4.
Ochiai H, Hayashi T, Umeda M, Yoshimura M, Harada A, Shimizu Y, et al. Genomewide kinetic properties of transcriptional bursting in mouse embryonic stem cells. Sci Adv. 2020;6(25):eaaz6699.
Johnsson P, Ziegenhain C, Hartmanis L, Hendriks GJ, HagemannJensen M, Reinius B, et al. Transcriptional kinetics and molecular functions of long noncoding RNAs. Nat Genet. 2022;54(3):1–12.
Herzog VA, Reichholf B, Neumann T, Rescheneder P, Bhat P, Burkard TR, et al. Thiollinked alkylation of RNA to assess expression dynamics. Nat Methods. 2017;14(12):1198.
Schofield JA, Duffy EE, Kiefer L, Sullivan MC, Simon MD. TimeLapseseq: adding a temporal dimension to RNA sequencing through nucleoside recoding. Nat Methods. 2018;15(3):221.
Jürges C, Dölken L, Erhard F. Dissecting newly transcribed and old RNA using GRANDSLAM. Bioinformatics. 2018;34(13):i218–26.
Qiu Q, Hu P, Qiu X, Govek KW, Cámara PG, Wu H. Massively parallel and timeresolved RNA sequencing in single cells with scNTseq. Nat Methods. 2020;17(10):991–1001.
Cao J, Zhou W, Steemers F, Trapnell C, Shendure J. Scifate characterizes the dynamics of gene expression in single cells. Nat Biotechnol. 2020;38(8):1–9.
Battich N, Beumer J, de Barbanson B, Krenning L, Baron CS, Tanenbaum ME, et al. Sequencing metabolically labeled transcripts in single cells reveals mRNA turnover strategies. Science. 2020;367(6482):1151–6.
Hendriks GJ, Jung LA, Larsson AJ, Lidschreiber M, Forsman OA, Lidschreiber K, et al. NASCseq monitors RNA synthesis in single cells. Nat Commun. 2019;10(1):1–9.
Qiu X, Zhang Y, MartinRufino JD, Weng C, Hosseinzadeh S, Yang D, et al. Mapping transcriptomic vector fields of single cells. Cell. 2022;185(4):690–711.
Erhard F, Baptista MA, Krammer T, Hennig T, Lange M, Arampatzi P, et al. scSLAMseq reveals core features of transcription dynamics in single cells. Nature. 2019;571(7765):419–23.
Boileau E, Altmüller J, Naarmannde Vries IS, Dieterich C. A comparison of metabolic labeling and statistical methods to infer genomewide dynamics of RNA turnover. Brief Bioinform. 2021;22(6):bbab219.
Zabet NR, Chu DF. Computational limits to binary genes. J R Soc Interface. 2010;7(47):945–54.
Patange S, Ball DA, Wan Y, Karpova TS, Girvan M, Levens D, et al. MYC amplifies gene expression through global changes in transcription factor dynamics. Cell Rep. 2022;38(4):110292.
Lu D, Jambhekar A, Lahav G. Louder for longer: MYC amplifies gene expression by extended transcriptional bursting. Cell Rep. 2022;38(9):110470.
JimenoGonzález S, Reyes JC. Chromatin structure and premRNA processing work together. Transcription. 2016;7(3):63–8.
Rahhal R, Seto E. Emerging roles of histone modifications and HDACs in RNA splicing. Nucleic Acids Res. 2019;47(10):4911–26.
Wolfe JC, Mikheeva LA, Hagras H, Zabet NR. An explainable artificial intelligence approach for decoding the enhancer histone modifications code and identification of novel enhancers in Drosophila. Genome Biol. 2021;22(1):1–23.
Arias AM, Brickman JM. Gene expression heterogeneities in embryonic stem cell populations: origin and function. Curr Opin Cell Biol. 2011;23(6):650–6.
Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics. 2009;25(14):1754–60.
Mchaourab ZF, Perreault AA, Venters BJ. ChIPseq and ChIPexo profiling of Pol II, H2A. Z, and H3K4me3 in human K562 cells. Sci Data. 2018;5(1):1–8.
Consortium EP, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57.
Vazquez BN, Thackray JK, Simonet NG, Chahar S, KaneGoldsmith N, Newkirk SJ, et al. SIRT7 mediates L1 elements transcriptional repression and their association with the nuclear lamina. Nucleic Acids Res. 2019;47(15):7870–85.
Radzisheuskaya A, Shliaha PV, Grinev VV, Shlyueva D, Damhofer H, Koche R, et al. Complexdependent histone acetyltransferase activity of KAT8 determines its role in transcription and cellular homeostasis. Mol Cell. 2021;81(8):1749–65.
De S, Edwards DM, Dwivedi V, Wang J, Varsally W, Dixon HL, et al. Genomewide chromosomal association of Upf1 is linked to Pol II transcription in Schizosaccharomyces pombe. Nucleic Acids Res. 2022;50(1):350–67.
Schoech AP, Zabet NR. Facilitated diffusion buffers noise in gene expression. Phys Rev E. 2014;90(3):032701.
Shahrezaei V, Swain PS. Analytical distributions for stochastic gene expression. Proc Natl Acad Sci. 2008;105(45):17256–61.
Morrison M, RazoMejia M, Phillips R. Reconciling kinetic and thermodynamic models of bacterial transcription. PLoS Comput Biol. 2021;17(1):e1008572.
Wang J, Huang M, Torre E, Dueck H, Shaffer S, Murray J, et al. Gene expression distribution deconvolution in singlecell RNA sequencing. Proc Natl Acad Sci. 2018;115(28):E6437–46.
Raj A, Peskin CS, Tranchina D, Vargas DY, Tyagi S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 2006;4(10):e309.
Munsky B, Khammash M. The finite state projection algorithm for the solution of the chemical master equation. J Chem Phys. 2006;124(4):044104.
Gupta A, Mikelson J, Khammash M. A finite state projection algorithm for the stationary solution of the chemical master equation. J Chem Phys. 2017;147(15):154101.
Roberts GO, Rosenthal JS. Optimal scaling of discrete approximations to Langevin diffusions. J R Stat Soc B Stat Methodol. 1998;60(1):255–68.
Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57(1):97109.
Vihola M. On the stability and ergodicity of adaptive scaling Metropolis algorithms. Stoch Process Appl. 2011;121(12):2839–60.
Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. 1984;6:721–41.
Gilks WR, Best NG, Tan KK. Adaptive rejection Metropolis sampling within Gibbs sampling. J R Stat Soc Ser C Appl Stat. 1995;44(4):455–72.
Gillespie DT. Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977;81(25):2340–61.
Fu GK, Xu W, Wilhelmy J, Mindrinos MN, Davis RW, Xiao W, et al. Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc Natl Acad Sci. 2014;111(5):1891–6.
Edwards DM, Davies P, Hebenstreit D. burstMCMC. GitHub. 2023. https://github.com/hebenstreitLab/burstMCMC. Accessed 30 Mar 2023.
Edwards DM, Davies P, Hebenstreit D. burstMCMCpreprocessing. GitHub. 2023. https://github.com/hebenstreitLab/burstMCMCpreprocessing. Accessed 30 Mar 2023.
Edwards DM, Davies P, Hebenstreit D. Synergising singlecell resolution and 4sU labelling boosts inference of transcriptional bursting. Zenodo. 2023. https://doi.org/10.5281/zenodo.7707970. Accessed 30 Mar 2023.
Qiu Q, Hu P, Qiu X, Govek KW, Cámara PG, Wu H. Massively parallel and timeresolved RNA sequencing in single cells with scNTseq. Gene Expression Omnibus. 2020. https://identifiers.org/geo:GSE141851. Accessed 13 Jan 2021.
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.
Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells. Gene Expression Omnibus. 2015. https://identifiers.org/geo:GSE65525. Accessed 25 Oct 2020.
Mchaourab ZF, Perreault AA, Venters BJ. ChIPseq and ChIPexo profiling of Pol II, H2A. Z, and H3K4me3 in human K562 cells. Gene Expression Omnibus. 2018. https://identifiers.org/geo:GSE108323 (2017). Accessed 16 Mar 2022.
Consortium EP, et al. An integrated encyclopedia of DNA elements in the human genome. Gene Expression Omnibus. 2012. https://identifiers.org/geo:GSE29611 (2011). Accessed 14 Mar 2022.
Vazquez BN, Thackray JK, Simonet NG, Chahar S, KaneGoldsmith N, Newkirk SJ, et al. SIRT7 mediates L1 elements transcriptional repression and their association with the nuclear lamina. Gene Expression Omnibus. 2019. https://identifiers.org/geo:GSE106964. Accessed 3 Mar 2023.
Radzisheuskaya A, Shliaha PV, Grinev VV, Shlyueva D, Damhofer H, Koche R, et al. Complexdependent histone acetyltransferase activity of KAT8 determines its role in transcription and cellular homeostasis. Gene Expression Omnibus. 2021. https://identifiers.org/geo:GSE158736. Accessed 3 Mar 2023.
Acknowledgements
We thank Louise Dyson for useful discussions regarding the mathematical theory relevant to the study and Francesca Mantellino for valuable input on the data processing strategy.
Review history
The review history is available as Additional File 3.
Peer review information
Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
The research was funded by the Biotechnology and Biological Sciences Research Council (BB/M01116X/1) and by the Engineering Physical Sciences Research Council (EP/T002794/1).
Author information
Authors and Affiliations
Contributions
DME, PD and DH conceived the study, while DME and DH designed the work. DME and PD acquired the data which DME analysed and DME, PD and DH interpreted. DME and PD created the software. DME wrote the manuscript with input from PD and DH. All authors approve the submission of the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Supplementary information: Contains methodological information on estimating capture efficiencies and 4sUmediated TC rates and for simulating data for validating algorithm performance. Also contains results of metagene and correlation analyses for various HMs not shown in the main text, as well as a correlation between our estimated transcript decay rates and previously published cellmatched decay rates.
Additional file 2.
High vs low noise cellspecific T>C rate distribution transition: Video gif showing the differential transition from surviving to new transcript pool for high and low noise genes through the cellspecific T>C rate distributions for data simulated with different pulse durations.
Additional file 3.
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Edwards, D.M., Davies, P. & Hebenstreit, D. Synergising singlecell resolution and 4sU labelling boosts inference of transcriptional bursting. Genome Biol 24, 138 (2023). https://doi.org/10.1186/s1305902302977y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305902302977y