- Open Access
CpG island composition differences are a source of gene expression noise indicative of promoter responsiveness
© The Author(s) 2018
- Received: 26 March 2018
- Accepted: 4 June 2018
- Published: 26 June 2018
Population phenotypic variation can arise from genetic differences between individuals, or from cellular heterogeneity in an isogenic group of cells or organisms. The emergence of gene expression differences between genetically identical cells is referred to as gene expression noise, the sources of which are not well understood.
In this work, by studying gene expression noise between multiple cell lineages and mammalian species, we find consistent evidence of a role for CpG islands as sources of gene expression noise. Variation in noise among CpG island promoters can be partially attributed to differences in island size, in which short islands have noisier gene expression. Building on these findings, we investigate the potential for short CpG islands to act as fast response elements to environmental stimuli. Specifically, we find that these islands are enriched amongst primary response genes in SWI/SNF-independent stimuli, suggesting that expression noise is an indicator of promoter responsiveness.
Thus, through the integration of single-cell RNA expression profiling, chromatin landscape and temporal gene expression dynamics, we have uncovered a role for short CpG island promoters as fast response elements.
- Gene expression noise
- Single cell
- Promoter response
Variability in gene expression across an isogenic population of cells has garnered significant interest over the past decade and a half. Initial studies of variability in gene expression, henceforth referred to as gene expression noise, from single promoters [1, 2] and simple gene regulatory circuits  have motivated work in pro- and eukaryote systems to characterise and investigate both the sources and consequences of this noise. The extent of noise within vertebrate, and more specifically mammalian, systems has received less attention . In part this is due to the simplicity and elegance of organisms that exist primarily in a single-cell state for much of their natural lives, but also the ease with which these organisms and systems can be manipulated to alter gene expression noise. This is especially important for understanding the drivers of expression noise, as well as its consequences at the biochemical, cellular and physiological levels.
In yeast, the presence of a TATA-box motif in the core promoter is linked with greater noise, which is associated with differences in nucleosome occupancy and pre-initiation complex dynamics [5, 6]. Examination of mammalian promoter features that influence individual-to-individual variability have highlighted a number of chromatin modifications linked to differential variability . Likewise, the integration of single-cell expression profiling with bulk chromatin modification data in embryonic cells has highlighted a role for differences in chromatin modifications between genes as a possible source of gene expression noise [8–10]. Specifically, both Kar et al.  and Faure et al.  show that promoter chromatin modifications are related to gene expression noise differences between promoters, whilst Wu et al. find that gene body chromatin methylation, specifically H3K79me3, is associated with differential noise . Additionally, both  and  demonstrate that promoters that are simultaneously marked with opposing chromatin states (repressive H3K27me3 and activation H3K4me3), so-called bivalent promoters , are associated with greater gene expression noise in bulk populations of embryonic stem cells. This may arise due to heterogeneity in the chromatin state across the populations of cells studied or due to opposing chromatin modification-associated activity at these promoters .
We hypothesise that there are a number of promoter features that may influence the expression noise of a particular gene. To understand the shared features of promoters, we investigated potential sources and causes of gene expression noise at a genome-wide scale. We find that CpG island promoters are associated with less gene expression noise than their non-CpG island counterparts, and that the characteristic features of CpG islands, e.g. polycomb repressor complex (PRC) and trithorax group (TrxG) chromatin modifications and CpG island size, contribute to differential noise between promoters. Within CpG island promoters, there remain extensive differences in gene expression noise, which is anti-correlated with CpG island size. We corroborate recent findings showing that bivalent promoters in mouse embryonic stem cells (mESCs) are the noisiest promoters [8, 10]. Expanding on the work by Faure et al. , we propose that the increased noise from short CpG islands is an indication of promoter dynamics. Specifically, we investigate whether short CpG islands act as agile response elements for switch/sucrose non-fermentable complex (SWI/SNF)-independent primary response genes. Analysis of time-series expression profiles from human and mouse studies suggests that under different stimuli, genes with short CpG islands respond earlier than those with longer CpG islands and that highly variable genes in unstimulated cells form part of the early response.
Genomic sources of gene expression noise
We considered different categories of genomic features involved in gene expression regulation, to capture generic features of gene promoter architecture that influence expression dynamics (Additional file 1: Table S1). These features can be sub-divided into static and dynamic types. Static features are invariant between cells, whilst dynamic features can vary between cells depending upon their lineage, type or state.
Using single-cell RNA-seq measurements from mESCs , we first explored which genomic features could underpin differences in (mean independent) expression noise across cells. Using the approach described (‘Methods’; Fig. 1a), we evaluated the effect of each promoter feature on gene expression noise (Fig. 1b–d). This included a multivariate robust linear model with all of the genomic features to test for linear independence between possible related features (Fig. 1b). To replicate the observed effects independently and to test their generalisability, we performed the same analysis using data derived from mouse Cd4+ naive T cells  (Additional file 2: Figure S1).
Unlike previous reports on Saccharomyces cerevisiae and mESCs, we do not observe a consistent correlation between predicted TATA box binding protein (TBP) motifs and differences in expression noise [10, 15] (Fig. 1b). In this study, we consider that the promoter encompasses a 1.5-kb region, whilst previous studies on TATA boxes and TBP binding have restricted their analysis to core promoter regions (∼200 bp) centred on the transcriptional start site. Using the same definition of TATA-box promoters as in [10, 16], we find that TATA-box promoters are associated with greater gene expression noise in our univariate, but not the multivariate, robust regression model (Additional file 2: Figure S2). Thus, this discrepancy arises due to differences between relying solely on predicted TBP motifs and more comprehensive promoter classifications, rather than the size of the promoter region per se.
We find in the univariate case that gene structure (i.e. transcript length, number of exons and mean exon length) has a relatively large influence on noise (Fig. 1b, circles). With the exception of mean exon length, these effects are consistently captured by other variables related to gene structure in both mESCs and Cd4+ T cells. Interestingly, we find that promoters with an overlapping CpG island are on average less variable than their non-CpG island counterparts (Fig. 1 and Additional file 2: Figure S1), concordant with a recent report by Faure et al. .
As we wish to understand the general features of mammalian promoters that influence their noise, we extended our analysis to several human cell types (Additional file 3: Table S2). In accordance with our observations in mouse, we observe that CpG islands are consistently associated with lower gene expression noise (Fig. 1d). The extent to which CpG islands are correlated with gene expression noise varies between cell types and between species. This may represent biological differences between developmental and evolutionary lineages or technical and experimental differences between studies. The data sets used in our analysis are all generated using the SMART-seq(2) chemistry [17, 18], and thus, may be susceptible to technical noise arising from fragment duplication. To test whether our results are affected by this potential bias, we also performed the same analysis using single-cell expression profiles from mESCs cultured in serum + leukaemia inhibitory factor, generated using unique molecular identifiers . We find that CpG islands remain associated with lower expression noise, suggesting that this correlation does not arise due to shared technical sources of variation in single-cell RNA-seq experiments (Additional file 2: Figure S3). Subsequently, we can confidently conclude that the relationship between differential noise and CpG island and non-CpG island promoters is a feature of mammalian genomes separated by ∼80 million years of evolution.
Characteristics of CpG islands associated with expression noise
Although genes with CpG island promoters are systematically less noisy than genes without a CpG island, there is still considerable variability in expression levels among CpG island genes (Fig. 1c, black outlier points). This raises the question of whether the characteristics of specific CpG islands also contribute to gene expression noise, which to our knowledge has not been previously addressed. We selected features of CpG islands to test for association with gene expression noise, such as CpG island size and the number of predicted SP1 binding motifs. Several features of CpG islands are highly correlated, e.g. CpG island size and CpG dinucleotide ratio (Additional file 2: Figure S4). For this reason, we selected CpG island size as a characteristic measure of CpG islands, as it has a more intuitive interpretation and has been linked with potential functional differences between genes .
The epigenetic landscape is correlated with expression noise
We wish to understand the mechanism that leads to differences in gene expression noise among CpG islands. The principal difference between CpG island and non-CpG island promoters is how their transcription is repressed or modulated. Non-CpG island promoters maintain transcription repression by cytosine methylation at CpG dinucleotides , though recent evidence demonstrates that CpG methylation alone cannot induce repression . CpG islands are constitutively unmethylated regions, despite a high CpG density that would attract active DNA methyltransferases . These unmethylated CpG dinucleotides provide a platform for binding by proteins that form part of the PRC and TrxG, via their CxxC domains [23, 24]. Thus, transcription from CpG island promoters is primarily regulated by chromatin histone modifications, in the absence of direct modification of DNA, such as cytosine methylation.
To understand how CpG islands contribute to noise in gene expression, we combined chromatin histone modification data associated with transcriptional activation and repression (H3K4me3 and H3K27me3 ChIP-seq; see ‘Methods’) on bulk mESCs with single-cell RNA-seq expression profiling. Previous work has highlighted the enrichment of genes involved in developmental processes amongst promoters with long CpG islands [20, 25]. One explanation for the observed relationship between CpG island size and decreasing expression noise (Fig. 2a), therefore, might be that developmental genes need to have more highly regulated expression. However, after removing >900 developmentally-associated genes, we find that the anti-correlation between CpG island features and expression noise remains, suggesting there is an alternative explanation (Additional file 2: Figure S6).
To investigate the potential for bivalency to explain the correlation between gene expression noise and CGI size, we categorised promoters as repressed, active or bivalent based on the combined H3K27me3 and H3K4me3 ChIP signal (Additional file 2: Figure S7). Consistent with recent reports [8, 10], our analysis reveals that bivalent promoters are highly variable, an effect that is not dependent on CpG island size (Fig. 3c).
Short CpG islands provide response agility to stimulus
Our observations that CpG island size is associated with expression noise is not explained by the tendency for developmental genes to be associated with longer CpG islands (Additional file 2: Figure S6), nor by any correlation between CpG island size and bivalency. The necessity for tight regulation (i.e. a larger buffer against inappropriate stochastic activation) for genes involved in coordinated developmental programmes is well described. Whilst the gene expression noise of individual pluripotency and differentiation factors is associated with cellular plasticity and cell fate choice [26, 27], the execution of any particular developmental programme is highly coordinated and consistent, e.g. gastrulation, limb bud formation, etc. Mammalian CpG islands range in size from ∼200 bp to 10’s of kilobases and CpG island size is proportional between human and mouse orthologous genes (Spearman rank correlation = 0.51; Additional file 2: Figure S8). Whilst large islands may lead to tighter transcriptional regulation, it is not immediately clear why there is also concordance in the size of short CpG islands between homologous mouse and human genes.
Whilst we consider gene expression noise to arise due to influences from both cis and trans factors, the utility of gene expression noise remains unresolved. Previous work has described the benefits of gene expression noise in eukaryote systems , but often in relation to a single gene  or small gene regulatory network. If fluctuations in gene expression (i.e. noise) are detrimental to fitness, then buffering mechanisms may have evolved to reduce the impact of expression noise  or there could be some degree of tolerance to noise due to a cost/benefit trade-off. Thus, is the observed greater expression noise in small CpG islands indicative of a cost/benefit trade-off?
Whilst we see consistent enrichment of short CpG islands amongst early response genes across a number of different cell types and stimuli, we also observed several exceptions to this pattern. For example, oestrogen receptor α (ER α) stimulated breast adenocarcinoma cells, and retinal pigment epithelial cells induced to an epithelial-to-mesenchymal transition with TGF β + TNF α displayed the opposite pattern of enrichment, i.e. longer CpG islands are enriched in early response genes (Fig. 4b). This opposing pattern of enrichment was also observed in mouse Cd8+ cytotoxic T cells stimulated with interferon β (IFN- β), amongst others (Additional file 2: Figure S10 and Additional file 4: Table S3). That we observe opposing patterns of enrichment within the same cell type but using different stimuli, i.e. MCF7 cells, indicates that there is a stimulus-specific mechanism. Previous work on stimuli-specific differences in IEG induction in mouse macrophages found a bias towards a SWI/SNF complex-dependence for IFN- β but not LPS . The SWI/SNF complex in both mammals and yeast plays a role as an ATP-dependent nucleosome re-modeller [32, 33]. Steroid-hormone nuclear receptors, including ER α, require SWI/SNF complex proteins in the induction of target genes [34–37]. Likewise, induction of TGF- β responsive genes via Smad 2/3 in epithelial cells is dependent on the SWI/SNF ATPase BRG1 [38, 39]. Thus, the SWI/SNF-(in)dependence between different stimuli reconciles the discordance in CpG island size usage between stimuli, even within cells of the same type and origin (Fig. 4).
If short CpG islands are fast response elements and gene expression noise diminishes with CpG island size, we would expect to observe that both short and noisy promoters are up-regulated immediately post-stimulation. To corroborate our hypothesis that short and noisy CpG island promoters are responsive promoters, we used single-cell RNA-seq expression data from stimulated BMDCs . Genes that are highly variable in unstimulated cells should be those genes most likely to respond. We performed differential expression testing between unstimulated BMDCs and those cells 1 hour post-LPS stimulation. Comparing genes that are up-regulated with their noise in unstimulated cells revealed that these genes are consistently more variable than those genes that are either down-regulated (one-tailed Kolmogorov–Smirnov test p=7.89×10−123) or for which expression does not change (one-tailed Kolmogorov–Smirnov test p=6.97×10−10). This effect is consistent for CpG island promoters up to 1.5 kb, but was still dependent on CpG island size (Fig. 4c). This relationship was maintained after adjusting for the mean dependence using rCV2 (Additional file 2: Figure S11b). Notably, we did not observe the same tendency for noisy genes with non-CpG island promoters to be immediately up-regulated (Additional file 2: Figure S11).
In this study, we investigated a number of promoter features and their association with differences in gene expression noise. In concordance with a recent systematic analysis in mESCs , we find that CpG island promoters are less variable than their non-CpG island counterparts. This observation, and its generalisability to different cell types and mammalian species, implicates transcription regulatory mechanisms as important sources of expression noise [7–9, 41]. Further, the defining characteristics of CpG islands are associated with differences in noise. In particular, our finding that CpG island size is negatively correlated with expression noise indicates that variation in CpG island composition explains a proportion of the differential noise between promoters.
Based on these observations, we propose that short CpG islands, as noisy promoters, may be indicative of transcriptional dynamism. Our observation of short CpG island enrichment amongst early response genes indicates the potential role for promoter-associated short CpG islands to act as rapid transcriptional response elements. These findings, supported by evidence that highly variable genes in unstimulated BMDCs are up-regulated immediately upon stimulation, suggest a potential cost/benefit trade-off between stochastic transcriptional activation (noise) and how quickly a promoter is able to respond to an external stimulus. Interestingly, Antolovic et al.  recently showed that noisier genes in undifferentiated Dictyostelium were more likely to be up-regulated upon induction of a differentiation signal. These findings in a less complex eukaryote organism, which are concordant with our observations in BMDCs, highlight the potential for gene expression noise to mark genes primed to respond as a general feature of transcriptional regulation. Dictyostelium species lack experimentally validated CpG islands, which suggests that the exact mechanism by which gene expression noise arises may differ between phylogenetic clades or species. This is important, as we do not observe the same responsiveness for noisy non-CGI promoters (Additional file 2: Figure S11c). Transcriptional dynamics and promoter sensitivity to a modulating stimulus (repressive or activatory) are influenced by the chromatin landscape [43, 44] and the presence of paused RNA polymerase [45, 46]. The class of genes associated with immediate response (IEGs) are rapidly induced within a few minutes of stimulation, without the need for prior protein synthesis . IEGs are associated with specific promoter architectures (high affinity TATA boxes and CpG islands) and encode shorter mRNA transcripts . Induction of IEGs is associated with post-translational modification of histone proteins, such as lysine acetylation and histone H3 serine phosphorylation . These modifications facilitate the binding of 14-3-3 and the SWI/SNF ATPase BRG1 . Our observation that specific stimuli do not appear to utilise short CpG islands as IEGs is potentially reconciled by the stimulus-specific recruitment of SWI/SNF complex proteins to these IEGs [34–37, 44]. Moreover, the current lack of data on SWI/SNF complex dynamics in early response to stimuli suggests that there is the potential to discover the molecular mechanisms underlying these observations. Thus, our observation that short CpG islands are enriched amongst early response genes indicates a potentially novel mechanism for mammalian IEG induction.
Whilst we use CpG island size as a definition for these rapid response elements, there are most likely additional influences from static and dynamic promoter features. For example, recent work in Drosophila has highlighted the importance of promoter shape in transcriptional dynamics and the importance of noise . CpG islands have been proposed as the vertebrate equivalent of polycomb response elements (PREs) in Drosophila, which is supported by evidence of PRC chromatin modification and PcG binding enrichment over these genomic features (reviewed in ). Thus, the link between CpG island size and response agility is potentially linked to differential dynamics in chromatin histone modifications or the binding of histone modification readers and writers.
Modelling of chromatin dynamics suggests that slow changes in chromatin modifications are required to induce transcriptional changes , a finding supported by the response model put forward by Klose et al. . Berry et al. model the impact of robustness to trans-activation, supportive of a buffer in the responsive model, and note that the width of the cis memory window can have a drastic impact on responsiveness to noise in trans activator levels . Thus, modelling of chromatin dynamics is concordant with the following: (a) CpG island size influences expression noise and (b) the smallest CpG islands provide the least buffering against stochastic fluctuations in trans activator levels.
The exact nature and source of gene expression noise within short CpG island promoters is not immediately clear. Recent evidence indicates that one common emerging source of gene expression noise is related to chromatin accessibility and dynamics [8–10]. It has been proposed that CpG islands generally have a more open or accessible conformation based on their lower affinity for histone proteins [31, 51]. However, more accessible chromatin is also associated with more promiscuous expression, which would not explain the increased noise observed at short CpG islands. Whether differences in noise arise due to the presence of paused or actively transcribing RNA polymerases, or the constitutive presence of trans activating factors requires further study.
Note that our measures of single-cell noise capture both technical and biological sources of variability. We do not expect technical sources of variation to distort systematically relationships between promoter architecture and gene expression noise, across multiple single-cell RNA-seq data sets. One possible source of confounding between our genomic features and noise is the sequence content of genic regions (e.g. coding sequence G + C content). However, exon G+C content is not related to CpG island composition or genomic position, indicating that our findings cannot be explained by technical sources of variation.
In this work, we used a mean-adjusted measure of noise, rCV2. We derive this measure by fitting a reciprocal relationship between CV2 and the mean expression for all genes, using a gamma distributed generalised linear model . The extent to which this regression removes any relationship between mean expression and the features and contexts in which we test gene expression noise is dependent on the quality of this model fit. CV2 is a real-valued positive variable whose sampling distribution is likely to be similar to other variance-like statistics that are usually described by a χ2 distribution. The χ2 distribution is a special case of a gamma distribution; thus, CV2 can be appropriately described by a gamma distribution. Therefore, we suggest that modelling the relationship between the mean and gene expression noise using a gamma generalised linear model provides the most appropriate model in this context.
Our findings indicate that noisy genes, independent of mean expression, tend to be the most rapidly up-regulated. This suggests that many genes are poised to react to environmental stimuli which may have consequences for understanding how, as well as the speed with which a cell is primed to react to its environment. For instance, stress response genes have been identified as particularly noisy [9, 52] and the expression of stress response genes diverges between yeast species . This indicates that highly responsive genes might evolve at different rates compared to more stably expressed and less noisy genes. Indeed, mutation accumulation experiments in yeast identified a correlation between the degree to which a gene promoter evolves and expression noise . Fast-evolving genes, i.e. those whose expression is not constrained by stabilising selection, may also represent a mechanism for generating phenotypic heterogeneity in a population [53, 55]. This may further provide a way to either promote or buffer against multiple different perturbations, e.g. environmental or mutational. Whether these principles generalise to more complex eukaryote organisms remains to be seen, and provides an exciting possible avenue of research.
In conclusion, we have shown that short and noisy CpG islands may act as rapid response elements to external stimuli. These findings raise interesting questions about what role transcriptional variation has to play in the cellular and physiological response of organisms to their environment, and how these mechanisms have evolved.
Single-cell RNA-seq data processing
Where available, gene-by-cell-expression count matrices were downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) or ArrayExpress (https://www.ebi.ac.uk/arrayexpress/; see Additional file 3: Table S2 for a list of accessions) [14, 19, 40, 56, 57]. Cells with low count numbers (<100 000) or that had been flagged as poor quality by the study meta-data (where available) were removed prior to normalisation. Within each data set separately, genes that were expressed in <1% of cells were removed, prior to cell size factor normalisation using the deconvolution approach  and log2 transformation.
Gene expression noise
To test the influence of genomic features on gene expression noise formally, we fitted a linear model to rCV2 in each data set and tested the slope of the fitted line, i.e. the regression coefficients. We observed that rCV2 had a long-tailed distribution (Additional file 2: Figure S1c). Ordinary least squares regression compares the mean difference between values of a predictor variable on the response variable. The breakdown point of the mean is 0; thus, only a single extreme value is required to bias its estimate. The median has a breakdown of 0.5, that is, it requires more than half of the values to be outliers to bias its estimation. Thus, ordinary least squares regression would be inappropriate in this context where there are many extreme rCV2 values, whilst a robust linear regression that uses the median would not be susceptible to the same extreme outlier values.
Gene promoters were defined for each mm10 and hg19 gene based on the Ensembl build v86 annotations (https://www.ensembl.org/index.html). Promoter regions encompassed −1 kb and + 500 bp centred on the transcriptional start site (or beginning of the first annotated exon), accounting for strand. Total transcript lengths, exon length variation and number of exons were calculated from the Ensembl v86 General Transfer Format (GTF) file.
CpG island data were downloaded from the UCSC Genome browser  for hg19 and mm10 using the table browser tool . Promoter sequence GC content was calculated for each promoter using bed2fasta from CGAT tools .
Transcription factor binding motifs were predicted over the length of all promoters using the MEME suite tool FIMO, with motif positional weight matrices provided from the JASPAR CORE vertebrate TF motif database (2016) [62, 63].
ChIP sequencing data processing
ChIP-seq libraries derived from mESCs were downloaded from GEO accession GSE36114 . Sequences were downloaded from the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena). Known Illumina sequencing adaptors were removed using trimmomatic prior to alignment to mm10 using bwa-mem . Aligned sequences were quantified over non-overlapping 200-nt windows on the mm10 genome in all ChIP and input libraries. log2-fold enrichment was calculated between each chromatin modification ChIP library and its matched input for each window. Signals over promoter intervals were calculated as the average log2-fold enrichment across overlapping windows for each replicate, then averaged across all replicates for the respective chromatin modification immunoprecipitation.
Bivalent promoters were calculated based on overlapping H3K27me3 and H3K4me3 ChIP signals. The H3K4me3 chromatin modification signal was dichotimised based on the local minimum of the kernel density estimate over all gene promoters. The H3K37me3 ChIP signals displayed a long tail without any noticeable minimum in density. Thus, to dichotomise the H3K27me3 signal, a threshold was set as the mean value across all gene promoters. Gene promoters were assigned to either active (H3K4me3) or repressed (H3K27me3) if the promoter signal exceeded 1.5 times the threshold for the relevant chromatin modification. Genes that exceeded both thresholds were assigned to the bivalent category.
Defining mouse embryonic development genes
Mouse embryonic development genes for Additional file 2: Figure S6 were defined from J1 embryonic stem cells (ESCs) differentiated over 14 days and measured at 11 time points . Gene expression data were downloaded from the StemBase data base (April 2018, http://www.stembase.ca)  and were based on measurements with an Affymetrix Mouse Expression 430a array. Embryonic genes were defined based on log2-fold change >0 over the differentiation time course, tested in a generalised linear model with a moderated t-test , at a false discovery rate of 1%.
CAGE processing and time-series analysis
Peak annotations for a human time series from a cap analysis of gene expression with sequencing (CAGE-seq) were acquired from the FANTOM5 consortium website (http://fantom.gsc.riken.jp/5/data/) [69, 70]. Tag counts were downloaded and split by time-series experiment. CpG islands were assigned to peak annotations within 500 bp centred on each CAGE peak. Peaks expressed in ≤75% of samples were removed prior to analysis for each time-series data set separately. Moderated log2-fold changes between time points were calculated and tag counts were modelled using a negative binomial generalised linear model implemented in the Bioconductor package edgeR [71, 72].
RNA-seq processing and time-series analysis
Gene-by-sample count matrices were downloaded from GEO for the relevant accessions [73, 74]. Genes containing on average >5 read counts across all samples were retained for analysis. Moderated log2-fold changes were estimated for each time-point comparison using the Bioconductor package DESeq2 .
CpG island size enrichment testing
For each data set, we compared the proportion of genes, ranked by log2-fold change in expression between time points, that had a smaller CpG island overlapping the defined promoter region in the earliest time-point comparison. We then applied a one-tailed binomial test against a null hypothesis of a 50:50 relationship between rank and CpG island size. This process is analogous to gene set enrichment testing, using the ranked test statistics from a differential expression test, or equivalent to a paired sample sign test (Additional file 2: Figure S12).
The code used to perform the analyses and the process data are available from https://github.com/MarioniLab/CpGisland2017.
We wish to thank Wolf Reik, Ferdinand von Meyenn and Aaron T. Lun for kindly donating single-cell gene expression data from human ESCs. The authors also wish to thank Sarah Thorpe and Christina Ernst for critical reading of the manuscript.
MDM was supported by the Wellcome Trust (grant 105045/Z/14/Z). JCM was supported by core funding from the European Molecular Biology Laboratory and from Cancer Research UK (award number 17197).
Availability of data and materials
Mus musculus naive Cd4+ T cells: Stubbington MJT, et al. T cell fate and clonality inference from single-cell transcriptomes. Array Express. E-MTAB-3857.
Mus musculus embryonic stem cells: Grun D, et al. Validation of noise models for single-cell transcriptomics. GEO. GSE54695.
Mus musculus embryonic stem cells: Kolodziejczyk AA, et al. Single-cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Array Express. E-MTAB-2600.
Homo sapiens embryonic stem cells: von Meyenn F, Lun AT, Marioni JC, Reik W. Array Express. E-MTAB-6819.
Homo sapiens pancreatic islet cells: Lawlor N, et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetes. GEO. GSE86473.
Mus musculus bone marrow-derived dendritic cells: Shalek AK, et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. GEO. GSE48968.
Homo sapiens MCF-7 breast adenocarcinoma cells: Baran-Gale J, et al. An integrative transcriptomics approach identifies miR-503 as a candidate master regulator of the oestrogen response in MCF-7 breast cancer cells. GEO. GSE78169.
Homo sapiens monocyte-derived dendritic cells: Diehl WE, et al. Ebola virus glycoprotein with increased infectivity dominated the 2013–2016 epidemic. GEO. GSE84865.
Homo sapiens CAGE-seq data sets: Forrest ARR, et al. FANTOM5. http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/.
MDM conceived the study, performed all analyses and wrote the manuscript. JCM conceived and supervised the study and wrote the manuscript. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ozbudak EM, Thattai M, Kurtser I, Grossman AD, van Oudenaarden A. Regulation of noise in the expression of a single gene. Nat Genet. 2002; 31(1):69–73.View ArticlePubMedGoogle Scholar
- Elowitz MB. Stochastic gene expression in a single cell. Science. 2002; 297(5584):1183–6.View ArticlePubMedGoogle Scholar
- Thattai M, van Oudenaarden A. Intrinsic noise in gene regulatory networks. Proc Natl Acad Sci. 2001; 98(15):8614–19.View ArticlePubMedPubMed CentralGoogle Scholar
- Raj A, Peskin CS, Tranchina D, Vargas DY, Tyagi S. Stochastic mRNA synthesis in mammalian cells. PLoS Biol. 2006; 4(10):e309.View ArticlePubMedPubMed CentralGoogle Scholar
- Brown CR, Boeger H. Nucleosomal promoter variation generates gene expression noise. Proc Natl Acad Sci. 2014; 111(50):17893–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Ravarani CNJ, Chalancon G, Breker M, de Groot NS, Babu MM. Affinity and competition for TBP are molecular determinants of gene expression noise. Nat Commun. 2016; 7:10417.View ArticlePubMedPubMed CentralGoogle Scholar
- Alemu EY, Carl JW, Corrada Bravo H, Hannenhalli S. Determinants of expression variability. Nucleic Acids Res. 2014; 42(6):3503–14.View ArticlePubMedPubMed CentralGoogle Scholar
- Kar G, Kim JK, Kolodziejczyk AA, Natarajan KN, Torlai Triglia E, Mifsud B, et al. Flipping between polycomb repressed and active transcriptional states introduces noise in gene expression. Nat Commun. 2017;8(36).Google Scholar
- Wu S, Li K, Li Y, Zhao T, Li T, Yang YF, et al. Independent regulation of gene expression level and noise by histone modifications. PLoS Comput Biol. 2017; 13(6):e1005585.View ArticlePubMedPubMed CentralGoogle Scholar
- Faure AJ, Schmiedel JM, Lehner B. Systematic analysis of the determinants of gene expression noise in embryonic stem cells. Cell Syst. 2017; 5(5):471–84.View ArticlePubMedGoogle Scholar
- Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, et al.A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006; 125(2):315–26.View ArticlePubMedGoogle Scholar
- Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, et al.Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013; 10(11):1093–5.View ArticlePubMedGoogle Scholar
- Kim JK, Kolodziejczyk AA, Illicic T, Teichmann SA, Marioni JC. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat Commun. 2015; 6:8687.View ArticlePubMedPubMed CentralGoogle Scholar
- Stubbington MJT, Lönnberg T, Proserpio V, Clare S, Speak AO, Dougan G, et al.T cell fate and clonality inference from single-cell transcriptomes. Nat Methods. 2016; 13(4):329–32.View ArticlePubMedPubMed CentralGoogle Scholar
- Newman JRS, Ghaemmaghami S, Ihmels J, Breslow DK, Noble M, DeRisi JL, et al.Single-cell proteomic analysis of S. cerevisiae reveals the architecture of biological noise. Nature. 2006; 441(7095):840–6.View ArticlePubMedGoogle Scholar
- Dreos R, Ambrosini G, Cavin Périer R, Bucher P. EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. 2013; 41(D1):D157—64.View ArticlePubMedGoogle Scholar
- Picelli S, Faridani OR, Björklund ȦK, Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc. 2014; 9(1):171–81.View ArticlePubMedGoogle Scholar
- Picelli S, Bjorklund ȦK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat Methods. 2013; 10(11):1096–8.View ArticlePubMedGoogle Scholar
- Grun D, Kester L, van Oudenaarden A. Validation of noise models for single-cell transcriptomics. Nat Methods. 2014; 11(6):637–40.View ArticlePubMedGoogle Scholar
- Elango N, Yi SV. Functional relevance of CpG island length for regulation of gene expression. Genetics. 2011; 187(4):1077–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Blackledge NP, Klose R. CpG island chromatin: a platform for gene regulation. Epigenetics. 2011; 6(2):147–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Ford EE, Grimmer MR, Stolzenburg S, Bogdanovic O, de Mendoza A, Farnham PJ, et al.Frequent lack of repressive capacity of promoter DNA methylation identified through genome-wide epigenomic manipulation; 2017. bioRxiv preprint.Google Scholar
- Orlando DA, Guenther MG, Frampton GM, Young RA. CpG island structure and trithorax/polycomb chromatin domains in human cells. Genomics. 2012; 100(5):320–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Blackledge NP, Thomson JP, Skene PJ. CpG island chromatin is shaped by recruitment of ZF-CxxC proteins. Cold Spring Harb Perspect Biol. 2013; 5(11):aX018648.View ArticleGoogle Scholar
- Robinson PN. Gene-ontology analysis reveals association of tissue-specific 5′ CpG-island genes with development and embryogenesis. Hum Mol Genet. 2004; 13(17):1969–78.View ArticlePubMedGoogle Scholar
- Kumar RM, Cahan P, Shalek AK, Satija R, DaleyKeyser AJ, Li H, et al.Deconstructing transcriptional heterogeneity in pluripotent stem cells. Nature. 2014; 516(7529):56–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Chambers I, Silva J, Colby D, Nichols J, Nijmeijer B, Robertson M, et al.Nanog safeguards pluripotency and mediates germline development. Nature. 2007; 450(7173):1230–4.View ArticlePubMedGoogle Scholar
- Levy SF, Ziv N, Siegal ML. Bet hedging in yeast by heterogeneous, age-correlated expression of a stress protectant. PLoS Biol. 2012; 10(5):e1001325.View ArticlePubMedPubMed CentralGoogle Scholar
- Schmiedel JM, Klemm SL, Zheng Y, Sahay A, Bluthgen N, Marks DS, et al.MicroRNA control of protein expression noise. Science. 2015; 348(6230):128–32.View ArticlePubMedGoogle Scholar
- Tullai JW, Schaffer ME, Mullenbrock S, Sholder G, Kasif S, Cooper GM. Immediate-early and delayed primary response genes are distinct in function and genomic architecture. J Biol Chem. 2007; 282(33):23981–95.View ArticlePubMedPubMed CentralGoogle Scholar
- Ramirez-Carrozzi VR, Braas D, Bhatt DM, Cheng CS, Hong C, Doty KR, et al.A unifying model for the selective regulation of inducible transcription by CpG islands and nucleosome remodeling. Cell. 2009; 138(1):114–28.View ArticlePubMedPubMed CentralGoogle Scholar
- Hirschhorn JN, Brown SA, Clark CD, Winston F. Evidence that SNF2/SWI2 and SNF5 activate transcription in yeast by altering chromatin structure. Genes Dev. 1992; 6(12a):2288–98.View ArticlePubMedGoogle Scholar
- Kwon H, Imbalzano AN, Khavari PA, Kingston RE, Green MR. Nucleosome disruption and enhancement of activator binding by a human SW1/SNF complex. Nature. 1994; 370(6489):477–81.View ArticlePubMedGoogle Scholar
- Belandia B. Targeting of SWI/SNF chromatin remodelling complexes to estrogen-responsive genes. EMBO J. 2002; 21(15):4094–103.View ArticlePubMedPubMed CentralGoogle Scholar
- John S, Sabo PJ, Johnson TA, Sung MH, Biddie SC, Lightman SL, et al.Interaction of the glucocorticoid receptor with the chromatin landscape. Mol Cell. 2008; 29(5):611–24.View ArticlePubMedGoogle Scholar
- Johnson TA, Elbi C, Parekh BS, Hager GL, John S. Chromatin remodeling complexes interact dynamically with a glucocorticoid receptor-regulated promoter. Mol Biol Cell. 2008; 19(8):3308–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Jeong KW, Lee YH, Stallcup MR. Recruitment of the SWI/SNF chromatin remodeling complex to steroid hormone-regulated promoters by nuclear receptor coactivator flightless-I. J Biol Chem. 2009; 284(43):29298–309.View ArticlePubMedPubMed CentralGoogle Scholar
- Xi Q, He W, Zhang XHF, Le HV, Massagué J. Genome-wide impact of the BRG1 SWI/SNF chromatin remodeler on the transforming growth factor β transcriptional program. J Biol Chem. 2008; 283(2):1146–55.View ArticlePubMedGoogle Scholar
- Ross S, Cheung E, Petrakis TG, Howell M, Kraus WL, Hill CS. Smads orchestrate specific histone modifications and chromatin remodeling to activate transcription. EMBO J. 2006; 25(19):4490–502.View ArticlePubMedPubMed CentralGoogle Scholar
- Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, et al.Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature. 2014; 510:363–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Ren G, Jin W, Cui K, Rodrigez J, Hu G, Zhang Z, et al. CTCF-mediated enhancer–promoter interaction is a critical regulator of cell-to-cell variation of gene expression. Mol Cell. 2017; 67(6):1049–1058.e6.View ArticlePubMedPubMed CentralGoogle Scholar
- Antolović V, Miermont A, Corrigan AM, Chubb JR. Generation of single-cell transcript variability by repression. Curr Biol. 2017; 27(12):1811–1817.e3.View ArticlePubMedPubMed CentralGoogle Scholar
- Laitinen J, Hölttä E. Methylation status and chromatin structure of an early response gene (ornithine decarboxylase) in resting and stimulated NIH-3T3 fibroblasts. J Cell Biochem. 1994; 55(2):155–67.View ArticlePubMedGoogle Scholar
- Drobic B, Pérez-Cadahía B, Yu J, Kung SKP, Davie JR. Promoter chromatin remodeling of immediate-early genes is mediated through H3 phosphorylation at either serine 28 or 10 by the MSK1 multi-protein complex. Nucleic Acids Res. 2010; 38(10):3196–208.View ArticlePubMedPubMed CentralGoogle Scholar
- Byun JS, Wong MM, Cui W, Idelman G, Li Q, De Siervi A, et al.Dynamic bookmarking of primary response genes by p300 and RNA polymerase II complexes. Proc Natl Acad Sci. 2009; 106(46):19286–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Hargreaves DC, Horng T, Medzhitov R. Control of inducible gene expression by signal-dependent transcriptional elongation. Cell. 2009; 138(1):129–45.View ArticlePubMedPubMed CentralGoogle Scholar
- Schor IE, Degner JF, Harnett D, Cannavò E, Casale FP, Shim H, et al.Promoter shape varies across populations and affects promoter evolution and expression noise. Nat Genet. 2017; 49(4):550–58.View ArticlePubMedGoogle Scholar
- Schuettengruber B, Bourbon HM, Di Croce L, Cavalli G. Genome regulation by polycomb and trithorax: 70 years and counting. Cell. 2017; 171(1):34–57.View ArticlePubMedGoogle Scholar
- Berry S, Dean C, Howard M. Slow chromatin dynamics allow polycomb target genes to filter fluctuations in transcription factor activity. Cell Syst. 2017; 4(4):445–457.e8.View ArticlePubMedPubMed CentralGoogle Scholar
- Klose RJ, Cooper S, Farcas AM, Blackledge NP, Brockdorff N. Chromatin sampling—an emerging perspective on targeting polycomb repressor proteins. PLoS Genet. 2013; 9(8):e1003717.View ArticlePubMedPubMed CentralGoogle Scholar
- Schones DE, Cui K, Cuddapah S, Roh TY, Barski A, Wang Z, et al.Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008; 132(5):887–98.View ArticlePubMedGoogle Scholar
- Bar-Even A, Paulsson J, Maheshri N, Carmi M, O’Shea E, Pilpel Y, et al.Noise in protein expression scales with natural protein abundance. Nat Genet. 2006; 38(6):636–43.View ArticlePubMedGoogle Scholar
- Tirosh I, Wong KH, Barkai N, Struhl K. Extensive divergence of yeast stress responses through transitions between induced and constitutive activation. Proc Natl Acad Sci. 2011; 108(40):16693–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Landry CR, Lemos B, Rifkin SA, Dickinson WJ, Hartl DL. Genetic properties influencing the evolvability of gene expression. Science. 2007; 317(5834):118–21.View ArticlePubMedGoogle Scholar
- Bodi Z, Farkas Z, Nevozhay D, Kalapis D, Lazar V, Csorgo B, et al. Phenotypic heterogeneity promotes adaptive evolution. PLoS Biol. 2017; 15(5):e2000644.View ArticlePubMedPubMed CentralGoogle Scholar
- Kolodziejczyk AA, Kim JK, Tsang JCH, Ilicic T, Henriksson J, Natarajan KN, et al.Single cell RNA-sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell. 2015; 17(4):471–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Lawlor N, George J, Bolisetty M, Kursawe R, Sun L, Sivakamasundari V, et al.Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetes. Genome Res. 2017; 27(2):208–22.View ArticlePubMedPubMed CentralGoogle Scholar
- L Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016; 17(1):75.View ArticlePubMed CentralGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al.The human genome browser at UCSC. Genome Res. 2002; 12(6):996–1006.View ArticlePubMedPubMed CentralGoogle Scholar
- Karolchik D. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004; 32(90001):493D–496.View ArticleGoogle Scholar
- Sims D, Ilott NE, Sansom SN, Sudbery IM, Johnson JS, Fawcett KA, et al.CGAT: computational genomics analysis toolkit. Bioinformatics. 2014; 30(9):1290–1.View ArticlePubMedPubMed CentralGoogle Scholar
- Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011; 27(7):1017–18.View ArticlePubMedPubMed CentralGoogle Scholar
- Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, et al.JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 2017; 46(D1):D260—6.Google Scholar
- Xiao S, Xie D, Cao X, Yu P, Xing X, Chen CC, et al.Comparative epigenomic annotation of regulatory DNA. Cell. 2012; 149(6):1381–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Sene K, Porter CJ, Palidwor G, Perez-Iratxeta C, Muro EM, Campbell PA, et al.Gene function in early mouse embryonic stem cell differentiation. BMC Genomics. 2007; 8(1):85.View ArticlePubMed CentralGoogle Scholar
- Sandie R, Palidwor GA, Huska MR, Porter CJ, Krzyzanowski PM, Muro EM, et al.Recent developments in StemBase: a tool to study gene expression in human and murine stem cells. BMC Res Notes. 2009; 2(1):39.View ArticlePubMedPubMed CentralGoogle Scholar
- Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al.limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(7):e47.View ArticlePubMedPubMed CentralGoogle Scholar
- Forrest ARR, Kawaji H, Rehli M, Kenneth Baillie J, de Hoon MJL, Haberle V, et al.A promoter-level mammalian expression atlas. Nature. 2014; 507(7493):462–70.View ArticlePubMedGoogle Scholar
- Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, et al.Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015; 16(1):22.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.View ArticlePubMedGoogle Scholar
- McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40(10):4288–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Baran-Gale J, Purvis JE, Sethupathy P. An integrative transcriptomics approach identifies miR-503 as a candidate master regulator of the estrogen response in MCF-7 breast cancer cells. RNA. 2016; 22(10):1592–603.View ArticlePubMedPubMed CentralGoogle Scholar
- Diehl WE, Lin AE, Grubaugh ND, Carvalho LM, Kim K, Kyawe PP, et al. Ebola virus glycoprotein with increased infectivity dominated the 2013–2016 epidemic. Cell. 2016; 167(4):1088–1098.e6.View ArticlePubMedPubMed CentralGoogle Scholar
- Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15(12):550.View ArticlePubMedPubMed CentralGoogle Scholar