Inferring steady state single-cell gene expression distributions from analysis of mesoscopic samples
© Mar et al.; licensee BioMed Central Ltd. 2006
Received: 4 August 2006
Accepted: 14 December 2006
Published: 14 December 2006
A great deal of interest has been generated by systems biology approaches that attempt to develop quantitative, predictive models of cellular processes. However, the starting point for all cellular gene expression, the transcription of RNA, has not been described and measured in a population of living cells.
Here we present a simple model for transcript levels based on Poisson statistics and provide supporting experimental evidence for genes known to be expressed at high, moderate, and low levels.
Although the model describes a microscopic process occurring at the level of an individual cell, the supporting data we provide uses a small number of cells where the echoes of the underlying stochastic processes can be seen. Not only do these data confirm our model, but this general strategy opens up a potential new approach, Mesoscopic Biology, that can be used to assess the natural variability of processes occurring at the cellular level in biological systems.
In the study of biological processes, most of our observations are based on measurements made on a macroscopic scale, such as a piece of tissue or the collection of cells in a tissue culture dish, while the processes themselves are driven by events that occur at a microscopic scale representing events within each individual cell. The paradox here is that, macroscopically, biological processes often seem deterministic and are driven by what we observe as the average behaviour of millions of cells, but microscopically we expect the biology, driven by molecules that have to come together and interact in a complex environment, to have a stochastic component. Indeed, studies of transcriptional regulation at the single cell level have uncovered examples of non-uniform behaviour of gene expression in genetically identical cells. Levsky et al.  were among the first to profile gene expression levels in single cells and their results provided direct evidence of variable expression patterns in otherwise identical cells. Ozbudak et al.  quantified the direct effect that fluctuations in molecular species had on the variation of gene expression levels in isogenic cells. By independently modifying transcription and translation rates of a single fluorescent reporter protein, they were able to observe the downstream effects this had on protein expression. From these experiments, the authors were able to conclude that protein production occurs in sharp, random bursts. This was further explored by Cai et al. , who developed a microfluidic-based assay to observe proteins being produced in real-time inside a living cell. They provide experimental proof that proteins are expressed in bursts and demonstrate that the number of molecules per burst follows an exponential distribution. While this represents an important advance, the mechanisms governing this behaviour are not yet fully known and building relevant models requires some knowledge of each of the basic processes involved in the pathway from DNA to RNA to protein.
Over the past 30 years, numerous mathematical models of stochastic gene expression have been proposed [4, 5]. Rao et al.  outline some of the most general of these approaches and show how they have been improved into more sophisticated models by various researchers. One of the most basic models is a stochastic differential equation that monitors the production rate of a molecular species (DNA, RNA or protein). This is simply a differential equation with a random noise term and a stochastic process or random variable that accounts for the amount of molecule available at a given time. Such models representing components of a particular system are then mathematically coupled to predict the output levels of genes, mRNAs, and proteins produced inside a single cell. A basic question that remains to be fully explored, however, is whether evidence of these stochastic elements exists and if gene expression is truly a stochastic process? With respect to RNA, the answers to these questions have, thus far, been elusive. The problem is that nearly the entirety of RNA expression data come from large samples where the observed gene expression levels are an ensemble average over millions of cells. However, what we ultimately want to understand is the distribution of RNA levels in individual cells, something that has been difficult to measure. Here we propose a simple but elegant solution to this problem, which we refer to as 'Mesoscopic Biology'. In this approach, we conduct experiments between the microscopic and macroscopic levels, working with a small but finite number of cells where measurements can be easily made but where evidence of stochastic processes operating at a cellular level are not lost through the biological averaging that occurs when in large samples.
As a demonstration of the power of the mesoscopic approach, we demonstrate for the first time that RNA transcript levels obey Poisson statistics for genes expressed at various levels within the cell. We begin by modelling mRNA copy number within a cell as a Poisson random variable and derive an analytical solution that captures the randomness in gene expression, manifested as an increase in measured biological variability as we decrease the number of cells assayed in a particular experiment. Using a dilution series experiment and measuring the expression of nine genes using quantitative real-time RT-PCR (qRT-PCR), we validate the model and provide estimates of the average expression level for each.
Results and discussion
The Poisson distribution is a mathematical function that assigns a probability to measuring a certain number of events within a defined time frame. The Poisson distribution is similar to the Normal or Gaussian distribution - the familiar 'bell curve' - except that, while the latter is centered symmetrically about its mean, the Poisson distribution is skewed to the right, and its 'mass' is concentrated somewhere on a scale between zero and infinity.
Poisson statistics have a long history of being used to model count data and counting processes  where there is a fixed lower limit in the count (zero). Consequently, a natural assumption is that the number of mRNA copies inside a single cell follows a Poisson distribution. If we view a whole tissue as being made up of N cells of the same type, then the corresponding expression levels for each gene, represented as the number of mRNA copy numbers in each cell, can be cast as a sample of N independent, identically distributed Poisson random variables; note this is a simplifying assumption that we have made for the purposes of modelling mRNA counts. Assigning a probability distribution function to mRNA copy numbers allows us to capture the stochastic nature of the underlying transcriptional process while providing a means to estimate overall properties and to make inferential statements about how these properties behave as we change the number of cells under analysis. In particular, such a statistical model allows us to estimate parameters, such as the average copy number per cell for each gene-specific transcript. Specifically, we expect the average gene expression to behave like a Normal random variable as the size of the biological sample (that is, the number of cells, N) grows. This result follows from the Central Limit theorem and gives us a way to derive analytical statements about how the variability in gene expression will change with sample size.
Specifically, suppose that each cell makes, on average, a certain number of copies (say λ) of a particular gene. In this case, the probability that a cell produces exactly x copies of a gene is given by the standard form of the Poisson probability distribution:
If we let denote the average gene expression across the total cell population, then for a large number of cells N, the average gene expression follows a Normal distribution with mean λ and variance . This simple model lets us analytically infer how biological variability will behave within a population of N 'identical' cells and make predictions that can be experimentally verified. Note that in any measurement, there are systematic sources of error (or variability) and those that represent the true distribution of the quantity we measure within the population. Biological variability refers to the 'noise' or variability specific to the biological system under study. Imagine that we were somehow able to control for all types of experimental and technical noise in our measurements, then the remaining variation would be a result of naturally occurring biological variability. The standard deviation of blood pressure measurements is an example of biological variability in a population of individuals. The variation in the number of transcripts in each cell is the biological variation we are trying to model.
Simulations: visualizing the model
Genes featured in the validation experiment
Any measured value ultimately represents a convolution of the true signal and an error associated with the measuring process. For macroscopic samples, separating out these two sources is typically straightforward, especially in the presence of a strong and genuine signal and low relative levels of background noise. When working with small samples, however, these two sources are more tightly entwined and the de-convolution process is a more challenging exercise. In assessing gene expression measurements obtained using qRT-PCR, the most significant source of error is the Monte Carlo effect , which can produce anomalies observed due to differences in amplification efficiencies between individual RNA species, particularly when a complex RNA sample is being used. In our analysis, the RNA dilution series was designed to allow us to estimate this effect as each pool at a particular dilution level should have the same approximate transcript density as samples in the experimental tissue culture dilution series. When considering biological and experimental sources of variability, it is reasonable to assume that these sources are both independent and, therefore, additive. Hence we can estimate the gene expression levels in our culture dilution by estimating the experimental variability from the RNA dilution series data and subtracting it from the culture dilution series data.
The raw qRT-PCR data were quantified using ABI Prism 7900HT SDS software (version 2.2.2, Applied Biosystems, Foster City, CA, USA). Estimates of experimental error at each dilution series step came from the within-sample variance of the gene expression measures (qRT-PCR quantification values) from the RNA dilution (). An estimate of the true biological variability was obtained by taking the variance of the gene expression measures from the culture dilution and subtracting , that is:
As we assume gene expression is Poisson, with mean λ, we can estimate the average expression per cell using simple linear regression, where the estimated biological variability is fit to a function of the form , where I represents a linear offset of the biological variability. We can interpret I as the value that, along with the estimate of λ, gives the approximate number of cells required in the assay for the biological variability effects to be negligible through the expression:
Estimates of model parameters λ and I
I (intercept estimate)
6.802453 × 109
-1.208535 × 109
2.122443 × 108
-4.484740 × 107
1.370801 × 106
-2.441642 × 105
8.885468 × 103
-1.719060 × 103
4.000586 × 107
-7.723061 × 106
3.656176 × 106
-6.591370 × 105
-2.590916 × 1010
-2.015157 × 109
1.464513 × 109
-2.127757 × 108
2.762874 × 107
-5.596301 × 106
Although evidence for stochastic processes in biology has been mounting for quite some time, there has only been a single published report of the variability of gene expression in single cells, which did not provide an underlying statistical model for mRNA representation within the cell . While this may seem to be minor, it represents a significant gap in our knowledge if we are to construct the sort of predictive models that are the aim of systems biology.
While we tend to think of a tissue sample as being homogeneous and to discuss levels of gene expression in terms of absolute numbers of copies per cell, our evidence indicates that gene expression levels obey simple and predictable Poisson statistics. When we imagine a gene expressed at 'five copies per cell', there clearly must be a range, with some cells expressing very few or no copies while others express the same gene at high levels and the Poisson distribution specifies the likelihood that any particular number of transcripts will be observed within a population of cells. In support of this proposed model, we provide experimental data that demonstrate precisely the behavior we predict for the variance as a function of the number of cells we sample. The evidence supporting this comes directly from sampling statistics: the variance in gene expression levels decays as 1/N, where N is the number of cells sampled. The beauty of this result is that it can be measured experimentally even for genes such as PIK3 that are expressed at very low levels and that such measurements can be used to estimate commonly quoted properties of the distribution, such as the average expression level. One caveat, of course, is that we are only observing steady state gene expression and have not taken into account the effects of cellular perturbations in which the overall patterns of expression may alter as cells begin transcriptional activity at different times so that the population average at any point may not appear Poisson. However, our results suggest that when 'bursts' of transcription (or translation) do occur, one must consider the probability distribution reflecting the number of molecules produced.
We also demonstrate something subtle but important: the effects of stochastic events occurring at a cellular level can be observed by looking at small but experimentally accessible numbers of cells. This suggests that other stochastic events occurring in single cells, even complex interactions in pathways, may reveal themselves through the analysis of samples of mesoscopic size. In many ways, this situation is analogous to one in statistical mechanics and thermodynamics. While we understand that the Ideal Gas Law describes gas dynamics for macroscopic samples, we know that, on a microscopic scale, the behavior of the gas molecules themselves are described by the Maxwell-Boltzman distribution. But observing individual molecules is essentially impossible. The compromise is to look at small numbers of molecules - mesoscopic samples - where one can begin to see deviations from the ideal gas behavior. Our hope in presenting this work is to open the door to a new approach to the study of biological systems in which, working with small but tractable numbers of cells, we can begin to explore the stochastic components of cellular processes. Understanding these effects will be essential if we are to develop useful systems biology approaches that do more than model average behavior but instead provide insight into the processes that lead away from the average to the development of disease phenotypes.
Materials and methods
SW620 cell culture
Cells from the human colon cancer cell line SW620 (American Type Culture Collection) were seeded in 100 mm tissue culture dishes using Dulbecco's Modified Eagle's Medium supplemented with 10% fetal bovine serum and 1% penicillin/streptomycin. Cells were cultured to a confluence of 1.0 × 107 cells at 37°C and 5% CO2.
RNA was extracted and purified using the Versagene RNA Purification Kit (Gentra Systems, Minneapolis, MN, USA) and the Absolutely RNA Miniprep and Microprep kits (Stratagene, La Jolla, CA, USA) according to each manufacturer's instructions. After RNA extraction from 1 × 107 cells using the Versagene RNA Purification kit, the RNA was subjected to a series of 4 1:10 dilutions to a final dilution of 1 × 103 cells, with 9 replicates at each RNA dilution level. With another tissue culture dish containing 1 × 107 cells, cells were removed from the monolayer and subjected to the same 1:10 dilution series prior to RNA extraction. After 4 dilutions, a final dilution of 1 × 103 cells was achieved, with 9 replicates at each cell dilution level. RNA was then extracted from each replicate in the dilution series using the Absolutely RNA Miniprep and Microprep kits.
Affymetrix microarray analysis
RNA from SW620 cells was prepared, labeled, and hybridized in triplicate to the Affymetrix U133Plus2 GeneChip™ according to the manufacturer's instructions (Affymetrix, Santa Clara, CA, USA). Probe sets were retained only if they appeared in three replicate arrays; the retained probe sets were assigned expression measures using the robust multi-array statistic developed by Irizarry et al. . Probe sets were matched using HUGO gene symbols. Genes were then sorted by expression values into low, medium and high expression groups based on quartiles (the lowest quartile was discarded). We selected candidate genes from these three groups based on information found in the literature. RT-PCR was performed on these genes to determine their expression levels, relative to each other. The final nine genes were selected to represent a reasonable degree of coverage across these three levels.
Total RNA was extracted from cells according to the procedures described above. These RNA samples were then reverse transcribed to produce cDNA using reagents from the TaqMan reverse transcription kit (Applied Biosystems, Foster City, CA, USA) and then subjected to quantitative PCR using SYBR Green (Applied Biosystems). SYBR Green incorporation was detected in real time using the ABI Prism 7900HT system and expression was quantified using 18S ribosomal RNA (Ambion, Austin, TX, USA) as a standard curve for normalization. Forward and reverse primer pair sequences (Invitrogen, Carlsbad, CA, USA) used for RT-PCR were: ACTB, (GGACTTCGAGCAAGAGATGG, AGGAAGGAAGGCTGGAAGAG); ATP5L, (CAAGGTTGAGCTGGTTCCTC, CACCAAACCATTCAGCACAG); GAPDH, (GAGTCAACGGATTTGGTCGT, GATCTCGCTCCTGGAAGATG); GNAS, (TGAACGTGCCTGACTTTGAC, TCCACCTGGAACTTGGTCTC); DDR1, (AATGAGGACCCTGAGGGAGT, CCGTCATAGGTGGAGTCGTT); PIK3, (GAGGAGGTGCTGTGGAATGT, GAGGAGGTGCTGTGGAATGT); PNN, (AGCGCACACGTAGAGACCTT, CCGCTTTTGCCTTTCAGTAG); POLH, (ATGGGACCGTAACTCAGCAC, TCAGGCTTGCCTGTAGGATT); ZCCHC7, (GGACCCAGCGGTACTATTCA, GGCTGGACAGGAATACAGGA).
Single cell RT-PCR
SW620 human colon cancer cells were cultured according to the procedures described above and harvested at a confluence of 2.41 × 107 cells. Cells were then diluted in sterile water to a final concentration of 1 cell/μl. A 96-well plate, each well containing one cell, was placed in a thermal cycler at 95°C for two minutes to pop the cells. DNase I was added to degrade DNA at 37°C for 1 hour. EDTA was added at a final concentration of 5 mM to protect the RNA, then incubated at 75°C for 10 minutes to deactivate the DNase I. Resulting RNA from single cells was then subjected to RT-PCR according to the procedures described above. One 384-well plate was used, yielding 360 samples in total (remaining wells were devoted to obtaining measurements for standard curves and negative controls).
Figure 4 represents curves fitted using simple linear regression modeling of the empirical data. The covariate in the regression model N (representing the number of cells) has been log10-transformed.
Based on derivations from the theoretical model, we expect to see the empirical variances, as calculated from our experimental data, to behave according to , in other words, a decay following a relationship with some scaling factor λ involved. To estimate this scaling factor we fitted a simple linear regression, using the transformed covariate 1/N* (where N* = log10 N). We did not force the regression line to pass through the origin, and hence allowed for a non-zero intercept in our model, which we denote as I. To derive a reasonable interpretation for the intercept I, imagine that as the variance approaches zero:
An easier way to interpret this is with respect to N, and if we rearrange the previous equation we get:
and, since this relationship only holds for values of N when the variance approaches zero or negligible levels, we denote this equation as:
to distinguish from all other values of N.
Poisson distribution analysis
Empirical evidence in support of the assumption that gene expression levels follow a Poisson distribution was strengthened by two simple statistical analyses. First, a histogram (Figure 4) of the gene expression levels obtained from the limiting dilution experiment for ACTB resembles the expected probability distribution function (values are skewed to the left). Second, we constructed a quantile-quantile plot, comparing empirical quantiles based on the ACTB gene expression levels with theoretical quantiles expected for a Poisson distribution (with mean equal to the observed mean). Quantiles, like percentiles and quartiles, represent summary statistics of the data that help us gauge the spread of the distribution of data points. For instance, the 25th percentile represents the value that 25% of the lowest data points fall below. While percentiles are achieved by dividing the data into 100 sections, and quartiles represent divisions into 4, a quantile represents a generalized term for any division. Quartiles and percentiles are actually 4-quantiles and 100-quantiles, respectively. The idea behind the quantile-quantile plot is to compare how the data points are distributed (relative to each other) in the empirical sample (where the distribution is typically unknown) with a theoretical sample that has been simulated under a distributional assumption.
The majority of the data follows the Poisson assumption; some apparent deviation was likely to be a result of experimental artefacts. A two-component Poisson mixture model was fitted to the histogram of RT-PCR quant values using a quasi-Newton method with constraints (via the optim function in R). The algorithm was terminated when the relative difference in the log-likelihood functions was less than 1.4901 × 10-8.
Data and software availability
All data generated and analyzed in this manuscript as well as the R code used in the analysis and a tutorial outlining the various steps are available from  so that readers can reproduce our results and apply a similar analysis to their own datasets.
Additional data file
The following additional data are available with the online version of this paper. Additional data file 1 is a .zip file containing the qRT-PCR data analyzed in this manuscript, the software (as R code) used to perform the analysis and produce the figures presented, and instructions on how to install R and perform the analysis as well as a "README" that explicitly describes each file in the .zip archive.
The authors would like to thank Aedin Culhane for assistance with the analysis of DNA microarray data to identify candidate genes used in this study and for truly invaluable discussions. This work was supported by funds provided by the Dana-Farber Cancer Institute and its strategic fund.
- Levsky JM, Shenoy SM, Pezo RC, Singer RH: Single-cell gene expression profiling. Science. 2002, 297: 836-840. 10.1126/science.1072241.PubMedView ArticleGoogle Scholar
- Ozbudak EM, Thattai M, Kurtser I, Grossman AD, van Oudenaarden A: Regulation of noise in the expression of a single gene. Nat Genet. 2002, 31: 69-73. 10.1038/ng869.PubMedView ArticleGoogle Scholar
- Cai L, Friedman N, Xie XS: Stochastic protein expression in individual cells at the single molecule level. Nature. 2006, 440: 358-362. 10.1038/nature04599.PubMedView ArticleGoogle Scholar
- Kaern M, Elston TC, Blake WJ, Collins JJ: Stochasticity in gene expression: from theories to phenotypes. Nat Rev Genet. 2005, 6: 451-464. 10.1038/nrg1615.PubMedView ArticleGoogle Scholar
- Paulsson J: Models of stochastic gene expression. Physics Life Reviews. 2005, 2: 157-175. 10.1016/j.plrev.2005.03.003.View ArticleGoogle Scholar
- Rao CV, Wolf DM, Arkin AP: Control, exploitation and tolerance of intracellular noise. Nature. 2002, 420: 231-237. 10.1038/nature01258.PubMedView ArticleGoogle Scholar
- Casella G, Berger RL: Statistical Inference. 2001, Pacific Grove, CA: Duxbury Press, 2Google Scholar
- Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-386.PubMedGoogle Scholar
- Bustin SA, Nolan T: Pitfalls of quantitative real-time reverse-transcription polymerase chain reaction. J Biomol Tech. 2004, 15: 155-166.PubMedPubMed CentralGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood estimation from incomplete data via the EM algorithm. J Royal Statist Soc B. 1977, 39: 1-38.Google Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-264. 10.1093/biostatistics/4.2.249.PubMedView ArticleGoogle Scholar
- Supplemental Data. [http://compbio.dfci.harvard.edu/pubs/stochastic.zip]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.