 Software
 Open Access
 Published:
BayFish: Bayesian inference of transcription dynamics from population snapshots of singlemolecule RNA FISH in single cells
Genome Biology volume 18, Article number: 164 (2017)
Abstract
Singlemolecule RNA fluorescence in situ hybridization (smFISH) provides unparalleled resolution in the measurement of the abundance and localization of nascent and mature RNA transcripts in fixed, single cells. We developed a computational pipeline (BayFish) to infer the kinetic parameters of gene expression from smFISH data at multiple time points after gene induction. Given an underlying model of gene expression, BayFish uses a Monte Carlo method to estimate the Bayesian posterior probability of the model parameters and quantify the parameter uncertainty given the observed smFISH data. We tested BayFish on synthetic data and smFISH measurements of the neuronal activityinducible gene Npas4 in primary neurons.
Background
Celltocell variation in gene expression across an isogenic population is a fact of life. The initiation of transcription involves a series of stochastic biochemical events, which includes chromatin accessibility, the binding of transcription factors, and the assembly of RNA polymerase at the promoter of a gene [1]. Distinct promoter states can often arise when one of these biochemical events is ratelimiting. The existence of multiple promoter states with different expression rates can generate transcriptional bursting, which are episodes of transcriptional activity followed by long periods of inactivity [2–4]. This phenomenon has been observed in bacteria [5, 6], yeast [7, 8], flies [9, 10], and mammals [11–16].
Celltocell variability in gene expression is often studied using techniques that measure transcription in single cells. One such technique, singlemolecule RNA fluorescence in situ hybridization (smFISH) measures the abundance and localization of individual transcripts in a single cell [17, 18]. This method uses a cocktail of fluorescently labelled DNA oligos complementary to the target RNA and works in many organisms; see [19]. Each individual transcript is bound by fluorescent DNA probes and appears as a bright, diffractionlimited spot in a fluorescence microscope. When there are multiple transcripts (e.g., active transcription sites, TSs, at gene loci), the measured intensity can be significantly brighter. The smFISH technique is simple and has been rapidly adopted by other labs to address celltocell variability in gene expression. This has been helped in part by software packages [20, 21] that facilitate image segmentation and spot analysis.
Gene expression is dynamic and the properties of transcriptional bursting must be inferred from smFISH data, which are static snapshots or distributions of mRNA and active TSs per cell sampled from a population. This inference is done using mathematical models of stochastic gene expression, whose predicted distributions of transcripts and active TSs in a population of cells are then fitted to smFISH data to infer the model parameters and likely properties of transcriptional bursting [22]. The simplest model that generates such bursting is the twostate model, which presumes that a gene stochastically switches between two promoter states, a transcriptionally active state and an inactive state. The advantage of the twostate model is that the distributions have been solved analytically and model parameters are inferred by fitting moments (e.g., mean and variance) or the full distributions to the observed smFISH data using leastsquares approaches. Despite the success of twostate models, more complicated models are often needed to explain the observed distributions properly [7, 15, 16, 23]. These complex models often do not have analytical solutions and one must resort to simplifying assumptions or computationally intensive numerical methods to calculate distributions [24]. This is especially true for genes that are not in a steady state, e.g., induced genes.
We used smFISH to measure transcripts of the neuronal activityinducible gene Npas4 in primary neurons after membrane depolarization with elevated extracellular potassium (Fig. 1). Our Npas4 smFISH measurements showed a surprising amount of celltocell variation in both transcript levels and active TSs even when all neurons were exposed to a uniform external stimulus. Given prior studies of celltocell variability in gene expression in other systems, this variability in the transcriptional response of activityinducible genes is likely to arise from the probabilistic activation of transcriptional bursting at single alleles. We reasoned that we could use this singlecell transcriptional variability to build a model of activityinducible Npas4 induction that would inform our quantitative understanding of the transcriptional processes that drive dynamic changes in Npas4 expression following neuronal activation.
Thus, we developed a computational pipeline (BayFish) that uses a Bayesian approach to infer the best model parameters from smFISH data and to quantify the uncertainty in those parameters rigorously. The user specifies any mathematical model of stochastic gene expression with an unknown set of parameters (θ) and provides smFISH data (Y) at different time points before and after induction. BayFish then uses a Monte Carlo method to estimate the Bayesian posterior probability P(θY) of the model parameters, which elucidates the bestfitting parameters and quantifies their uncertainty given the current smFISH data. We first tested BayFish on synthetic data and demonstrate how to select the best model from multiple mathematical models by combining information criteria with the likelihood and Bayesian posterior calculated by BayFish. We then used BayFish on the Npas4 smFISH data to infer the parameters of an underlying twostate model of gene expression that were likely affected by the stimulus. Our results show that a twostate promoter model can recapitulate Npas4 dynamics after induction and we further inferred that the transition rate from the promoter off state to the on state is increased by the stimulus.
There is currently no software that allows a user to specify any model of stochastic gene expression, evaluate the time evolution of mRNA and active TS distributions after induction, and rigorously infer parameters and confidence intervals from smFISH data using the Bayesian posterior probability. We expect BayFish to fill an important gap that will facilitate the adoption of the smFISH technique by other laboratories that wish to address celltocell variability in gene expression.
Results
BayFish is a software package that combines numerical methods with a Monte Carlo method to estimate the Bayesian posterior probability P(θY) of model parameters (θ) given the observed smFISH data (Y) at different time points before and after induction. Bayes theorem states that P(θY)=P(Yθ)P(θ)/P(Y) where P(Yθ) is the likelihood \(\mathcal {L}\) of the data given the parameters. P(θ) and P(Y) are the prior probability distributions of the parameters and the data, respectively. Each iteration of the Monte Carlo method uses several numerical subroutines to calculate the time evolution of the mRNA and active TS distributions given a set of model parameters (θ), to evaluate the likelihood that the smFISH data (Y) were sampled from this distribution or \({\mathcal {L} = P(Y  \theta)}\), and to calculate the Bayesian posterior probability \(\mathcal {P} = P(\theta  Y)\) given the likelihood and priors. The global program is based on the Metropolis random walk algorithm [25, 26]:

1.
Specify a mathematical model of stochastic gene expression that has an unknown set of parameters θ.

2.
Choose an initial θ and calculate the corresponding likelihood \(\mathcal {L} = P(Y  \theta)\) and Bayesian posterior probability \(\mathcal {P} = \mathcal {L} P(\theta) / P(Y)\) using several numerical subroutines.

3.
Iterate over t={1,2,…,T} as follows:

(a)
Draw a random proposal \(\phi \sim \theta _{t} + \mathcal {N}(0,\Sigma)\), where \(\mathcal {N}(0,\Sigma)\) is a multivariate normal distribution with the same dimension as θ and with zero mean. Σ is the covariance matrix.

(b)
Evaluate the likelihood of the proposal \(\mathcal {L}_{\phi } = P(Y  \phi)\) using several numerical subroutines.

(c)
Calculate the Bayesian posterior probability \(\mathcal {P}_{\phi } = \mathcal {L}_{\phi } P(\phi) / P(Y)\).

(d)
Update parameters θ _{ t+1}←ϕ and \(\mathcal {P}_{t+1} \leftarrow \mathcal {P}_{\phi }\) with probability \(\text {min}(\mathcal {P}_{\phi } / \mathcal {P}_{t},1)\); otherwise, θ _{ t+1}←θ _{ t } and \(\mathcal {P}_{t+1} \leftarrow \mathcal {P}_{t}\).

(a)
Over time, the algorithm will generate a Markov chain of θ _{ t } whose distribution converges to the Bayesian posterior probability P(θY). BayFish saves the likelihood \(\mathcal {L}_{t}\) and θ _{ t } of each step. After discarding the early part of the chain (the burnin phase), the remaining θ _{ t } values were used to estimate the Bayesian posterior probability P(θY); see “Methods.”
Mathematical model of stochastic gene expression
We considered a twostate model of gene expression (Fig. 2), where each promoter can be in an inactive off state ρ _{0} with a basal transcription level (synthesis rate μ _{0}) or an active on state ρ _{1} with a higher transcription level (synthesis rate μ _{1}). Transitions between promoter states occur with a promoter activation rate k _{1} and a promoter deactivation rate k _{0}. We chose a twostate model because it is the simplest model that can generate transcriptional bursting, a feature observed in our Npas4 smFISH data (Fig. 1). Neurons are postmitotic and, thus, our model does not include duplicated alleles (e.g., three or four active loci) that arise after DNA replication. Each promoter allele was assumed to be regulated independently, as shown previously [11, 15, 23, 27]. The twostate model parameter set, which determines the dynamics of mRNA and active promoters, is θ={μ _{0},μ _{1},k _{1},k _{0}}. We fixed the mRNA degradation rate δ because it is a known quantity, but this parameter could be a free parameter in other models.
Our smFISH experiments measured gene expression both before and after the stimulus. We presumed that gene expression before the stimulus was at a steady state determined by one set of model parameters (θ ^{U}, unstimulated parameter set). Upon induction, the stimulus changed one or more of the model parameters (θ ^{S}, stimulated parameter set). Thus, the mRNA and active TS distribution will evolve towards a new steady state in response to the changed parameters. Below, we describe how we calculated the stationary mRNA and active TS distribution before the stimulus using θ ^{U} and how we then calculated the time evolution of the distribution after the stimulus using θ ^{S}.
Time evolution of the probability distribution
The chemical master equation (CME) is an infinite set of coupled differential equations that describe the dynamics of the probability of the biochemical system being in a particular state x at time t, P(x,t) [28, 29]. The probability flow into and out of each state x is given by:
The summation is over all possible biochemical reactions k into and out of state x:
where a _{ k }(x) ∂ t is the probability that the biochemical reaction k will occur within the infinitesimal time interval ∂ t given that the system is in state x. The model parameters θ affect the propensities of different biochemical reactions (Fig. 2), and the stoichiometric vector (ν _{ k }) of reaction k describes how the system state changes when the reaction k occurs. More generally, the CME is written in matrix form:
where all possible cell states X are enumerated as a vector [x _{ 1 },x _{ 2 },…,x _{ N }]^{T}. P(X,t) is the probability density state vector [P(x _{ 1 },t),P(x _{ 2 },t),…,P(x _{ N },t)]^{T} of possible states organized identically to X. The state reaction matrix A(θ) has elements:
Prestimulus stationary distribution
We assumed that the prestimulus mRNA and active TS distribution P ^{∗}(X) is timeindependent and stationary. We calculated the stationary distribution by setting Eq. 3 to zero and determined the nonzero eigenvector V≥0 in the kernel of A(θ ^{U}) using the Arnoldi iteration algorithm [30] (eigs MATLAB function, or eig_gen Armadillo C++ library). Each element of P ^{∗} is given by
where V _{ i } is the ith element in the vector V=[ V _{1},V _{2},…,V _{ N }]^{T} and \({\sum \nolimits }_{i} P^{*}(\boldsymbol {x_{i}})=1\).
Poststimulus distribution dynamics
Given an initial distribution P ^{∗}(X) at time zero and poststimulus state reaction matrix A(θ ^{S}), the poststimulus distribution P(X,τ) at time τ after stimulus is:
We calculated P(X,τ) after induction using the same MATLAB routines from the finite state projection method [24], or the equivalent functions in the Armadillo C++ library. We used finite state projection to verify that our estimated probability distributions were below the error threshold (ε≤10^{−12}) for finite M; see below.
Likelihood of smFISH data from probability distributions
The smFISH data are for a finite sample of cells at several time points {0,τ _{1},τ _{2},…,τ _{ S }} after induction. Each cell was in a state, i.e., number of active TSs and mRNA molecules, contained within [x _{ 1 },x _{ 2 },…,x _{ N }]^{T}. The size (N) of the vector and matrix is determined by N=p(M+1), where p is the number of distinct promoter states per cell (p=3 for a twostate model and two alleles per cell, i.e., a cell can have zero, one, or two active TSs). M is the maximum number of mRNA molecules a cell can display, which could, in principle, be infinite. For practical purposes, we chose M=500 because it is finite and larger than the expected mRNA levels in our smFISH data. The smFISH data vector Y ^{t} for sample t is a count of observed cell states, where [n _{1},n _{2},…,n _{ N }]^{T}. The likelihood of having sampled the observed data given the calculated distributions P(X,τ) for model parameters θ is a product of multinomial distributions:
Calculating the Bayesian posterior probability
The Bayesian posterior probability is the likelihood \(\mathcal {L}\) multiplied by P(θ) and divided by P(Y), which are the prior probability distributions of the parameters and data. These priors are often unknown and P(θ) and P(Y) are presumed flat and constant, i.e., any parameter set and data set are equally likely. BayFish assumes flat priors unless specified otherwise. We implemented a Heaviside step function for P(θ), where the prior was zero for nonphysiological parameters, but otherwise flat and constant. Nonphysiological parameters include negative numbers (i.e., below 10^{−8}) or a maximum transcription rate (i.e., 12–18 mRNAs per minute; see [31]).
Validating BayFish with synthetic smFISH data
To test the ability of BayFish to infer parameters correctly, we generated synthetic smFISH data from a twostate model with known parameters. Our first model was a k _{1}stimulus model, where k _{1} changed from \(k_{1}^{U}\) to \(k_{1}^{S}\) upon induction and all other parameters stayed constant; see “Methods”. We created three technical replicates of synthetic smFISH data with a similar sampling density and number of time points as our Npas4 data. Each technical replicate (Fig. 3 a) is different from the others only because of sampling error. We then ran BayFish using an underlying k _{1}stimulus model to infer the free parameters of each technical replicate. The mRNA degradation rate was not a free parameter in these BayFish runs and was fixed to its known value to mimic our situation for Npas4.
If the synthetic smFISH data were too sparse to constrain the model, then we would expect the Bayesian posterior distributions to be flat. However, each BayFish run converged to welldefined Bayesian posterior distributions of model parameters and the technical replicates had posterior distributions that were relatively close to one another and overlapped the true underlying parameters (Fig. 3 b). This demonstrates that sparsely sampled smFISH data at multiple time points already constrain the parameters of the underlying model. We then created a synthetic smFISH data set using the same k _{1}stimulus model, but varied the sampling density at each time point (n=30,100,300,1000 cells). As expected, increasing the sampling density better constrained the Bayesian posterior distribution and more accurately estimated the underlying model parameters (Fig. 3 c).
Model selection using BayFish and information criteria
Previously, we initialized BayFish with the correct underlying model (k _{1}stimulus model). However, one does not usually know the correct model and it has to be inferred along with the unknown parameters. It is well known that models with more parameters have a higher likelihood of fitting the data. Thus, we combined BayFish with several likelihoodbased metrics to evaluate different underlying models and penalize those with more free parameters (see “Methods”). These metrics are the Bayesian information criterion (BIC) [32] and the Akaike information criterion (AIC) [33], which are based on the maximum likelihood calculated by BayFish. The deviance information criterion (DIC) [34] uses both the likelihood and the Bayesian posterior distribution calculated by BayFish.
To test the ability of BayFish and information criteria to select the correct model, we generated two synthetic smFISH data sets from different parameterstimulus models. The first set was generated using a k _{1}stimulus model, whereas the second set was generated using a more complex (k _{1},k _{0},μ _{1})stimulus model; see “Methods.” We then systematically ran BayFish using multiple underlying parameterstimulus models, where different combinations of parameters were affected by the stimulus: k _{1}, k _{0}, μ _{1}, (k _{1},μ _{1}), (k _{0},μ _{1}), and (k _{1},k _{0},μ _{1})stimulus models. The oneparameterstimulus models had five free parameters and the threeparameterstimulus model had seven free parameters to be inferred. As before, the mRNA degradation rate was fixed to its known value. We ran three replicas of BayFish with random initial parameters for T=10^{5} iterations for each underlying parameterstimulus model. We then plotted the different information metrics obtained from each BayFish run on the k _{1}stimulus synthetic data set (Fig. 4 a) and the (k _{1}, k _{0}, μ _{1})stimulus synthetic data set (Fig. 4 b). A lower information criterion score indicates that the underlying model had a better fit. Our results with synthetic data demonstrate that BayFish and the different information criteria select the correct underlying model.
Running BayFish on Npas4 smFISH data
We then used BayFish to infer parameters and select an underlying parameterstimulus model for the Npas4 smFISH data. We used the same approach as above, but the Npas4 mRNA degradation rate constant was fixed to δ=0.0559 min ^{−1} [35]. Our results demonstrate that the best underlying model with the fewest parameters is the (k _{1},k _{0})stimulus model (Fig. 5). The inferred mRNA and active TS distribution and Bayesian posterior distribution of the (k _{1},k _{0})stimulus model are shown in Fig. 6 and summarized in Table 1. Model selection using BayFish and information criteria also showed that not all parameters are equivalent. Regulation of k _{1} by the stimulus consistently gave a better fit to the observed data than regulation by k _{0} or μ _{1} alone or in combination.
Discussion
Like any inference approach, BayFish is limited by the information content of the data and the underlying model assumptions. For example, our smFISH data measured the mRNA and active TS counts per cell, but one could also measure the brightness of TS sites from the smFISH data to estimate the number of nascent mRNAs and, hence, the transcription rate (μ _{1}) [20]. This additional information could further constrain the underlying model, as has been done by others [15]. We did not include or fit TS intensity in our Npas4 model and, thus, this provides an independent test of the μ _{1} parameter inferred by BayFish. We estimated nascent mRNAs and the transcription rate from Npas4 active TSs (Additional file 1: Figure S2). The estimated transcription rate has a strong mode between 7–10 mRNA min ^{−1}, which is consistent with our inferred μ _{1} of 8 mRNA min ^{−1}; see Table 1.
However, our analysis also shows that there are caveats with estimating the transcription rate from the integrated intensity of active TSs. First, the estimate depends on the choice of transcription elongation rate, which can vary across genes and organisms [31]. Second, some active TSs have more nascent mRNAs and a higher transcription rate than theoretically possible (grey area in Additional file 1: Figure S2). The simplest explanation for these unusually bright spots is that mRNAs continue to be associated with chromatin at active TSs after transcription until further processing [36]. Thus, the integrated intensity of an active TS cannot be assumed to represent only nascent mRNAs in the process of transcription.
Our mathematical model of stochastic gene expression also assumed that each promoter allele was regulated independently [11, 15, 23, 27]. However, previous work has also shown that genes can exhibit strongly correlated gene expression, particularly when integrated adjacent to one another on the same chromosome [14, 37]. If one Npas4 allele is independent of the other, then we expect the active TS to exhibit a binomial distribution of zero, one, or two active TSs with probability (1−p)^{2}, 2p(1−p), or p ^{2} where p=k _{1}/(k _{0}+k _{1}), i.e., the probability of an active allele or burst fraction. Although our results show a statistically significant difference between the measured and expected fractions for independent alleles (Additional file 1: Figure S3), the data are closer to independent alleles than perfectly correlated alleles (i.e., there are no cells with one active TS). Modeling the weak correlations between alleles and the posttranscriptional processing of mRNAs at active TSs is beyond the scope of our current software package, but these could be potentially informative extensions of BayFish.
Conclusions
We developed a suite of MATLAB programs (BayFish), and an alternative C++ version, that use Bayesian inference to estimate model parameters robustly from smFISH data. We expect this software package to be useful for other labs because it fills a critical gap in the downstream analysis of population snapshots of smFISH in single cells. The user specifies any mathematical model of stochastic gene expression with an unknown set of parameters (θ) and provides smFISH data (Y) of mRNA and active TS counts in a population of cells at different time points before and after induction. BayFish uses a Monte Carlo method to estimate the Bayesian posterior probability P(θY) of the model parameters, which elucidates the bestfitting parameters and quantifies their uncertainty. Based on the confidence intervals of inferred parameters from a current data set, BayFish permits labs to design the next set of experiments and collect additional smFISH data (e.g., different times or more cells) that is maximally informative.
We generated synthetic data to validate the ability of BayFish to infer the correct parameters and tested its performance on smFISH data sets with sampling error. We further demonstrated how BayFish can be combined with information criteria to select the most informative underlying model. Finally, we used BayFish to extract meaningful biological information from Npas4 gene expression in single neurons (Fig. 1). Our results favor a twostate model where the stimulus increases k _{1} and decreases k _{0}. Both parameters modulate the Npas4 burst fraction, e.g., fraction of time that a promoter spends in the active, on state, without changing the transcription rate of the on or off state. Modulation of the burst fraction upon induction is consistent with previous observations for other genes [13, 15, 38], although modulation of the transcription rate (μ _{1}) upon induction has also been documented [14]. Future experiments will address mechanisms of activation and celltocell variability in Npas4 and other immediateearly genes of primary neurons. This can be done by combining genetic and pharmacological perturbations of gene expression with downstream BayFish analysis of multicolor smFISH distributions of several immediateearly genes.
Methods
Npas4 smFISH measurements in single neurons
Neuronenriched cultures were generated from the cortex of male and female E16.5 CD1 mouse embryos (Charles River Laboratories Inc., Wilmington, MA, USA) and cultured as previously described [39]. Neurons were treated with 1 µM tetrodotoxin (TTX) (Tocris Cookson, Ballwin, MO, USA), a sodium channel inhibitor, at DIV6 and depolarized by elevating the extracellular potassium concentration to 55 mM with an isotonic KCl solution at DIV7 [40], which activates Ltype voltagegated calcium channel dependent transcription of Npas4 [41]. Cells were fixed at four time points: no KCl, 5 min KCl treatment, 5 min KCl treatment plus 10 min condition medium, and 5 min KCl treatment plus 20 min condition medium as indicated in Fig. 1.
Neurons were fixed in 4% Paraformaldehyde (PFA) at room temperature for 10 min after sampling and permeabilized by 70% (v/v) EtOH at 4 °C overnight. The mouse Npas4 mRNAs were hybridized with the Quasar^{®;} 570 Stellaris RNA FISH Probe set following the manufacturer’s instructions, which are available online. Custom Stellaris^{®;} FISH Probes were designed against mouse Npas4 mRNA by utilizing the Stellaris^{®;} RNA FISH Probe Designer (Biosearch Technologies, Inc., Petaluma, CA, USA), which is available online. We hybridized probes to samples in a hybridization buffer (10% formamide, 10% 20 × SSC, 10% dextran sulfate, 1 mg mL ^{−1} Escherichia coli tRNA, 2 mM vanadyl ribonucleoside complex, and 20 µg mL ^{−1} Bovine serum albumin (BSA)) at 37 °C for 4 hours followed by Hoechst staining. Zstack images were captured on a widefield microscope (DMI4000, Leica) equipped with a CCD camera (DFC365 FX, Leica) and controlled by MetaMorph (Molecular Devices). An objective with NA 1.4 and 63 × magnification yielded an xy pixelsize of 146 nm. Then, 35–45 Zslices were recorded with a 200 nm stepsize and 1 second exposure time.
We used FISHquant [21] to identify and count absolute mRNA numbers and active TSs in single cells (Fig. 1). The active TSs can be detected because nascent mRNAs are transiently attached to the elongating RNA Polymerase II in the gene, accumulating fluorescent probes around active sites, and then appear as highly intense dots (one or two, as there are two copies of the gene) in the nucleus of the diploid cell. We and others have confirmed that these nuclear spots mark the active TSs because they colocalize in twocolor smFISH with probes specific for the gene introns, which are present only in nascent RNAs (data not shown and [42]).
Monte Carlo sampling and burnin
The number of iterations (T), covariance matrix (Σ), and burnin period were determined by monitoring the acceptance rate of proposals and the distribution of parameters and likelihood in the stationary phase of the Monte Carlo algorithm. The rate at which the Markov chain approaches stationarity (i.e., the region with higher likelihood) depends on the covariance matrix Σ used to draw new proposals. We defined the burnin as the initial period where the loglikelihood was increasing and less than 99.5% of the maximum. The burnin period is sensitive to the initial parameters and the parameterstimulus model. Given our experimental data, we verified that T=10^{5} iterations and our covariance matrix Σ were sufficient for BayFish to achieve stationarity and adequately sample the Bayesian posterior distribution after discarding the burnin. The final covariance matrix Σ was diagonal with 10^{−5} for k _{0},k _{1},μ _{0} and 10^{−3} for μ _{1} proposals.
Generating synthetic smFISH data
For a given stimulus model and known parameter set (truth), we calculated the mRNA and active TS distributions using the algorithms described in the main text. The prestimulus stationary distribution at t=0 min was generated using the unstimulated parameters, whereas the poststimulus distributions at t=5,15,25 min were generated using the stimulated parameters. From these distributions, we created a technical replicate by randomly sampling n cells from each mRNA and active TS distribution calculated at each time point. We generated technical replicates of synthetic smFISH data for two parameterstimulus models: a k _{1}stimulus and a (k _{1}, k _{0}, μ _{1})stimulus model. The k _{1}stimulus model had the following six parameters: \(k_{1}^{U} = 0.01\) min ^{−1}, \(k_{1}^{S} = 1\) min ^{−1}, k _{0}=0.1 min ^{−1}, μ _{1}=2 mRNA min ^{−1}, μ _{0}=0.01 mRNA min ^{−1}, and δ=0.05 min ^{−1}. The (k _{1}, k _{0}, μ _{1})stimulus model had the following eight parameters: \(k_{1}^{U} = 0.01\) min ^{−1}, \(k_{1}^{S} = 1\) min ^{−1}, \(k_{0}^{U} = 1\) min ^{−1}, \(k_{0}^{S} = 0.01\) min ^{−1}, \(\mu _{1}^{U} = 0.2\) mRNA min ^{−1}, \(\mu _{1}^{S} = 2\) mRNA min ^{−1}, μ _{0}=0.01 mRNA min ^{−1}, and δ=0.05 min ^{−1}.
Information criterion and model fitting
We used several information criteria, such as BIC [32], AIC [33], and DIC [34], to evaluate the likelihood of different models and to penalize model overfitting:

Bayesian information criterion:
$$ \text{BIC} = 2 \ln(\hat{\mathcal{L}}) + m \ln (n). $$(8) 
Akaike information criterion:
$$ \text{AIC} = 2 \ln (\hat{\mathcal{L}})  2 m + \frac{2m(m+1)}{nm1}. $$(9)
The maximum likelihood \(\hat {\mathcal {L}}= P(Y\hat {\theta })\) is the maximum value of \(\mathcal {L}\) obtained during the BayFish run, m is the number of free parameters that were fit, and n is the total sample size. These metrics do not take full advantage of the Bayesian posterior probability estimated by BayFish. Thus, we also used:

Deviance information criterion:
$$ \text{DIC} = 2 \bar{D}  D(\bar{\theta}). $$(10)
The deviance is
\(\bar {D} = E[D(\theta)]\) is the mean of the deviance D(θ) calculated from the Bayesian posterior probability, whereas \(D(\bar {\theta }) = D(E[\theta ])\) is the deviance of the mean of θ calculated from the Bayesian posterior probability.
Availability and requirements
Project name: BayFish
Project homepage: https://github.com/mgschiavon/BayFish; http://doi.org/10.5281/zenodo.830056
Operating system: Platform independent
Programming language: MATLAB or C++
Other requirements: See README file in the project homepage.
License: GNU General Public License v3.0
Additionally, the datasets analyzed during the current study are also available in the GitHub repository, https://github.com/mgschiavon/BayFish/tree/master/DATA.
Abbreviations
 AIC:

Akaike information criterion
 BIC:

Bayesian information criterion
 CME:

Chemical master equation
 DIC:

Deviance information criterion
 FISH:

Fluorescence in situ hybridization
 RNA:

Ribonucleic acid
 smFISH:

Singlemolecule RNA FISH
 TS:

transcription site
References
Lenstra TL, Rodriguez J, Chen H, Larson DR. Transcription dynamics in living cells. Annu Rev Biophys. 2016; 45(1):25–47.
Kaufmann BB, van Oudenaarden A. Stochastic gene expression: from single molecules to the proteome. Curr Opin Genet Dev. 2007; 17(2):107–12.
Sanchez A, Golding I. Genetic determinants and cellular constraints in noisy gene expression. Science. 2013; 342(6163):1188–93.
Suter DM, Molina N, Naef F, Schibler U. Origins and consequences of transcriptional discontinuity. Curr Opin Cell Biol. 2011; 23(6):657–62.
Golding I, Paulsson J, Zawilski SM, Cox EC. Realtime kinetics of gene activity in individual bacteria. Cell. 2005; 123(6):1025–36.
Taniguchi Y, Choi PJ, Li GW, Chen H, Babu M, Hearn J, et al.Quantifying E. coli proteome and transcriptome with singlemolecule sensitivity in single cells. Science. 2010; 329(5991):533–8.
Neuert G, Munsky B, Tan RZ, Teytelman L, Khammash M, van Oudenaarden A. Systematic identification of signalactivated stochastic gene regulation. Science. 2013; 339(6119):584–7.
Zenklusen D, Larson DR, Singer RH. SingleRNA counting reveals alternative modes of gene expression in yeast. Nat Struct Mol Biol. 2008; 15(12):1263–71.
Bothma JP, Garcia HG, Esposito E, Schlissel G, Gregor T, Levine M. Dynamic regulation of eve stripe 2 expression reveals transcriptional bursts in living Drosophila embryos. Proc Natl Acad Sci. 2014; 111(29):10598–603.
Fukaya T, Lim B, Levine M. Enhancer control of transcriptional bursting. Cell. 2016; 166(2):358–68.
Bahar Halpern K, Tanami S, Landen S, Chapal M, Szlak L, Hutzler A, et al.Bursty gene expression in the intact mammalian liver. Mol Cell. 2015; 58(1):147–56.
Battich N, Stoeger T, Pelkmans L. Control of transcript variability in single mammalian cells. Cell. 2015; 163(7):1596–610.
Dar RD, Razooky BS, Singh A, Trimeloni TV, McCollum JM, Cox CD, et al.Transcriptional burst frequency and burst size are equally modulated across the human genome. Proc Natl Acad Sci. 2012; 109(43):17454–9.
Raj A, Peskin CS, Tranchina D, Vargas DY, Tyagi S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 2006; 4(10):e309.
Senecal A, Munsky B, Proux F, Ly N, Braye FE, Zimmer C, et al.Transcription factors modulate cFos transcriptional bursts. Cell Rep. 2014; 8(1):75–83.
Suter DM, Molina N, Gatfield D, Schneider K, Schibler U, Naef F. Mammalian genes are transcribed with widely different bursting kinetics. Science. 2011; 332(6028):472–4.
Femino AM, Fay FS, Fogarty K, Singer RH. Visualization of single RNA transcripts in situ. Science. 1998; 280(5363):585–90.
Levsky JM, Shenoy SM, Pezo RC, Singer RH. Singlecell gene expression profiling. Science. 2002; 297(5582):836–40.
Raj A, van den Bogaard P, Rifkin SA, van Oudenaarden A, Tyagi S. Imaging individual mRNA molecules using multiple singly labeled probes. Nat Methods. 2008; 5(10):877–9.
Bahar Halpern K, Itzkovitz S. Single molecule approaches for quantifying transcription and degradation rates in intact mammalian tissues. Methods. 2016; 98:134–42.
Mueller F, Senecal A, Tantale K, MarieNelly H, Ly N, Collin O, et al.FISHquant: automatic counting of transcripts in 3D FISH images. Nat Methods. 2013; 10(4):277–8.
Munsky B, Fox Z, Neuert G. Integrating singlemolecule experiments and discrete stochastic models to understand heterogeneous gene transcription dynamics. Methods. 2015; 85:12–21.
Sepulveda LA, Xu H, Zhang J, Wang M, Golding I. Measurement of gene regulation in individual cells reveals rapid switching between promoter states. Science. 2016; 351(6278):1218–22.
Munsky B, Khammash M. The finite state projection algorithm for the solution of the chemical master equation. J Chem Phys. 2006; 124(4):044104.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. J Chem Phys. 1953; 21:1087–92.
Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970; 57:97–109.
Skinner SO, Xu H, NagarkarJaiswal S, Freire PR, Zwaka TP, Golding I. Singlecell analysis of transcription kinetics across the cell cycle. eLife. 2016; 5(12):7250–7.
McQuarrie DA. Stochastic approach to chemical kinetics. J Appl Probab. 1967; 4:413–78.
Gillespie DT. Exact stochastic simulation of coupled chemical reactions. J Phys Chem. 1977; 81(25):2340–61.
Lehoucq RB, Sorensen DC. DeflationTechniques for an implicitly restarted Arnoldi iteration. SIAM J Matrix Anal Appl. 1996; 17:789–821.
Struhl K. Transcriptional noise and the fidelity of initiation by RNA polymerase II. Nat Struct Mol Biol. 2007; 14(2):103–5.
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978; 6(2):461–4.
Akaike H. Information theory and an extension of the maximum likelihood principle In: Parzen E, Tanabe K, Kitagawa G, editors. Selected papers of, Hirotugu Akaike. New York: Springer New York: 1998. p. 199–213.
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. J R Stat Soc Ser B Stat Methodol. 2002; 64(4):583–639.
Speckmann T, Sabatini PV, Nian C, Smith RG, Lynn FC. Npas4 transcription factor expression is regulated by calcium signaling pathways and prevents tacrolimusinduced cytotoxicity in pancreatic beta cells. J Biol Chem. 2016; 291(6):2682–95.
Bhatt DM, PandyaJones A, Tong AJ, Barozzi I, Lissner MM, Natoli G, et al.Transcript dynamics of proinflammatory genes revealed by sequence analysis of subcellular RNA fractions. Cell. 2012; 150(2):279–90.
Becskei A, Kaufmann BB, van Oudenaarden A. Contributions of low molecule number and chromosomal positioning to stochastic gene expression. Nat Genet. 2005; 37(9):937–44.
Larson DR, Fritzsch C, Sun L, Meng X, Lawrence DS, Singer RH. Direct observation of frequency modulated transcription in single cells using light activation. eLife. 2013; 2(2):1–20.
McDowell KA, Hutchinson AN, WongGoodrich SJ, Presby MM, Su D, Rodriguiz RM, et al.Reduced cortical BDNF expression and aberrant memory in Carf knockout mice. J Neurosci. 2010; 30(22):7453–65.
Lyons MR, Chen LF, Deng JV, Finn C, Pfenning AR, Sabhlok A, et al.The transcription factor calciumresponse factor limits NMDA receptordependent transcription in the developing brain. J Neurochem. 2016; 137(2):164–76.
Lin Y, Bloodgood BL, Hauser JL, Lapan AD, Koon AC, Kim TK, et al.Activitydependent regulation of inhibitory synapse development by Npas4. Nature. 2008; 455(7217):1198–204.
Levesque MJ, Raj A. Singlechromosome transcriptional profiling reveals chromosomal gene expression regulation. Nat Methods. 2013; 10(3):246–8.
Acknowledgments
We are grateful to Sayan Mukherjee and Stefano Di Talia for advice and feedback.
Funding
This work was supported by a CONACYT graduate fellowship (MGS), the National Institutes of Health Exploratory/Developmental Research Grant Award R21DA041878 (AEW), the National Institutes of Health Director’s New Innovator Award DP2 OD00865401 (NEB), the Burroughs Wellcome Fund CASI Award BWF 1005769.01 (NEB), and seed funding from the Duke Center for Genomic & Computational Biology (AEW and NEB).
Ethics approval
All experiments were conducted in accordance with an animal protocol approved by the Duke University Institutional Animal Care and Use Committee.
Author information
Authors and Affiliations
Contributions
MGS developed the software and analyzed the data. LFC performed the smFISH experiments and analyzed the data. AEW and NEB conceived the project and supervised the research. MGS and NEB wrote the manuscript. All authors read, edited, and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary figures S1–S3. (PDF 233 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
GómezSchiavon, M., Chen, LF., West, A. et al. BayFish: Bayesian inference of transcription dynamics from population snapshots of singlemolecule RNA FISH in single cells. Genome Biol 18, 164 (2017). https://doi.org/10.1186/s1305901712979
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1305901712979
Keywords
 Gene expression
 Stochastic process
 Chemical master equation
 Likelihood methods
 Monte Carlo sampling
 Bayesian posterior probability