jMOSAiCS: joint analysis of multiple ChIP-seq datasets

The ChIP-seq technique enables genome-wide mapping of in vivo protein-DNA interactions and chromatin states. Current analytical approaches for ChIP-seq analysis are largely geared towards single-sample investigations, and have limited applicability in comparative settings that aim to identify combinatorial patterns of enrichment across multiple datasets. We describe a novel probabilistic method, jMOSAiCS, for jointly analyzing multiple ChIP-seq datasets. We demonstrate its usefulness with a wide range of data-driven computational experiments and with a case study of histone modifications on GATA1-occupied segments during erythroid differentiation. jMOSAiCS is open source software and can be downloaded from Bioconductor [1].


Background
The advent of high-throughput next generation sequencing (NGS) technologies has revolutionized the fields of genetics and genomics by allowing rapid and inexpensive sequencing of billions of bases. Among the NGS applications, ChIP-seq (chromatin immunoprecipitation followed by NGS) is perhaps the most successful to date. Initial ChIP-seq studies largely focused on single-sample investigations. However, as we begin to understand the role of epigenomics in biological variation, detailed comparisons of transcription factor (TF) binding and epigenomic marks between different tissues and individuals at single or multiple time points or developmental stages are becoming essential to understand the etiology and progression of many diseases. Therefore, comparative analysis of multiple ChIP-seq samples to identify combinatorial TF binding or epigenome profiles are rapidly emerging. Some examples include: (i) identifying differential binding of a TF or modification of a histone mark across multiple individuals, for example, [2] studied variation in binding of NF-B and RNA polymerase II (Pol II) across ten individuals; [3] performed a genetic analysis of Ste12 binding in yeast by studying differential binding across 43 segregants of a cross between two yeast strains; (ii) genome-wide binding profiles of multiple TFs in a single tissue or cell line, for example, comparative analysis of 22 Caenorhabditis elegans TFs [4]; (iii) time course or multiple developmental stage ChIP-seq experiments, for example, Pol II binding at six developmental stages of C. elegans [4]; and (iv) comparative analysis of binding profiles of one or more TFs with Pol II or modifications of histone marks, for example, [5,6].
Although there are already more than 30 algorithms and methods for ChIP-seq analysis (reviewed in [7]), all of them are limited to single-sample analysis and lack the ability to simultaneously compare multiple ChIP samples. The small number of available multi-sample ChIP-seq analysis tools are either specific to ChIP-seq design (for example, [8] is specifically for identifying chromatin states from ChIP-seq of histone modifications; [9] focuses on gene-centric analysis), exploratory [10] or difficult to generalize to more than two samples [11][12][13] due to computational reasons. This presents challenges for biological interpretation since combining results from individual analysis of multiple experiments can be a daunting task, especially for systematically enumerating combinatorial patterns of enrichment, controlling the overall false discovery rate (FDR), and prioritizing candidate regions for further experimental validation. A more recent genome segmentation algorithm, Segway [14], designed for multiple ChIP-seq datasets, utilizes a dynamic Bayesian network method and offers flexibility by enabling analysis at 1 bp resolution. However, the current Segway implementation requires specialized cluster management systems and is not readily available for running on standard desktops or computing clusters without a cluster management system. We introduce jMOSAiCS (joint model-based one-and two-sample analysis and inference for ChIP-seq), which is a probabilistic model for integrating multiple ChIP-seq datasets to identify combinatorial patterns of enrichment.
The key components of jMOSAiCS are base models for the sequencing reads of each individual ChIP-seq experiment and a model that governs the relationship of enrichment among different samples. We chose well-developed models from the ChIP literature for both of these key components. We evaluate jMOSAiCS with extensive datadriven computational experiments and compare it to both a separate analysis approach of multiple datasets and chromHMM [8]. We show that jMOSAiCS, which is applicable to both TF and histone ChIP-seq data, has better power and provides better false discovery rate control than the separate approach. We present an application of jMOSAiCS to multiple histone modifications during erythroid differentiation [6]. This analysis identified a cluster of GATA1-occupied loci exhibiting a pattern of enrichment that is different than that identified by chromHMM analysis of the same datasets. We support our computational predictions by experimental validation of the predicted patterns of histone modifications for a number of selected loci. These results indicate that jMOSAiCS can reveal both global and local combinatorial enrichment patterns with high sensitivity.

Model description
The most commonly used NGS platform for ChIP-seq is the Illumina platform [5,[15][16][17], which works by sequencing 25 to 100 bp from one or both ends of each DNA fragment in the sample of interest and generates millions of short reads. Standard pre-processing of reads involves mapping to a reference genome and summarizing total counts in each small non-overlapping interval (referred to as bins). Statistical analysis to detect enriched regions, that is, peaks, in a single ChIP-seq sample is based on these counts and is carried out as a one-or two-sample analysis depending on the availability of a control sample. In contrast, inference from multiple samples involves classifying regions of genome into patterns of enrichment. For D samples, we can observe up to 2 D different enrichment patterns across genomic regions. For example, for D = 2, {00, 01, 10, 11} is the set of possible patterns: 00 means not enriched in either of the samples; 10 enriched only in sample 1; 01 enriched only in sample 2 and 11 enriched in both samples.
We consider I genomic regions of possibly different lengths across a reference genome. These initial set of I regions can be obtained by analyzing each dataset separately with one of the many available ChIP-seq analysis methods [7] and identifying regions of enrichment at a liberal FDR level. Let unobserved random variable E id ∈ {0, 1} denote enrichment for region i in dataset d. The overall enrichment pattern E i is defined as the vector (E i1 , ..., E iD ). Our joint model has three layers, which are depicted in Figure 1. The first layer, called the E layer, concerns joint modeling of E id for inferring combinatorial enrichment. This is enabled by defining a region-level random variable B i as described below. The second layer, called the Y layer, concerns observed read count data for region i across D samples: and L i denotes the number of bins in region i. In the case of a two-sample problem, Y idj is vector-valued and denotes both the ChIP and control counts for the jth bin of the ith region in the dth sample. We assume that the counts from different samples are independent conditional on the enrichment pattern: where r = 1, ..., R represents possible enrichment patterns. Note that Pr(Y id | E i = r) = Pr(Y id | E id = r d ), r d = 0, 1, and only concerns data for the L i bins from the dth sample. E id = 0 implies that all the bins in region i are from the background (unenriched) component in the dth sample. In contrast, if E id = 1, one or more bins show enrichment. The third layer, called the Z layer, concerns Z idj , which we define as the bin-specific enrichment variable. If the jth bin in the ith region is enriched in dataset d, then Z idj = 1 and is 0 otherwise. We assume that Z idj , j = 1, . . . , L i , ∀d, i are independent and conditional on the region-specific enrichment indicator E id and hence: The key to our joint modeling approach are the models we utilize for the E and Y layers. For the E layer, we adopt the joint ChIP-chip model of JAMIE [18], which facilitates information sharing across experiments by capturing the correlation among datasets. In this model, the broad dependencies among the D samples are captured via unobserved variable B, where B i ∈ {0, 1} denotes whether region i is potentially enriched and E id is defined to be 1 if region i is enriched in sample d. We assume that E i1 , ..., E iD are conditionally independent given B i . Let Pr(B i = 1) = τ 1 , Pr(E id = 1 | B i = 1) = η d , and Pr(E id = 1 | B i = 0) = 0, that is, the region cannot be enriched in any dataset if B i = 0. Then, we have: The joint probability of (E i1 , ..., E iD ) is given by: For the Y layer, we adopt the model-based approach of MOSAiCS [19] since MOSAiCS provides parametric models for read counts from both the enriched and unenriched regions in both the one-(without a control sample) and two-sample (with a control sample) problems. At the bin level, Y idj | Z idj = 0~N idj , where N idjÑ egBin (a, a/µ idj ) are the background read counts. Its mean µ idj is parameterized as log μ idj = β 0 + β 1 X c idj , where X idj are the bin-level read counts in the control sample and c is a transformation parameter set dataadaptively. For one-sample analysis without a control sample or for two-sample analysis with a shallow sequenced control sample, MOSAiCS provides a parameterization of the bin-level counts that also depends on mappability and guanine-cytosine (GC) content. For the enriched bins, Y idj | Z idj = 1~N idj + S idj , where S idj is the signal due to enrichment, that is, protein binding or epigenomic marker modification. The signal S idj is modeled either as a single negative binomial distribution or a mixture of two negative binomial distributions. This choice is based on model fit and is determined through Bayesian information criterion (BIC) [20] by MOSAiCS. For model fitting, we utilize the fact that MOSAiCS provides fast and accurate estimates of the dataset-specific background and signal distributions. Therefore, as a part of model fitting, jMOSAiCS only needs to infer parameters associated with the B and E variables, namely τ 1 and η d , d = 1, ..., D. In addition, jMOSAiCS provides posterior probabilities of the B and E variables that facilitate identification of region-specific enrichment patterns across the D datasets. We implemented jMOSAiCS as an R package and it is available from Bioconductor [1]. Additional Files 1 and 2 provide a freeze of the R package and its vignette. These are included in the manuscript for archival i-th region of the genome Figure 1 Pictorial depiction of the jMOSAiCS model for a region across two ChIP-seq datasets. Region i consists of three bins. The B variable governs whether or not the region is enriched in any of the two samples. The E variables denote sample-specific enrichments and are conditionally independent given the B variable. The Z variables depict enrichment at the bin level and are conditionally independent given the sample-specific E variables. When E id = 1, one or more consecutive Z variables are set to 1 to capture enrichment. The observed read count Y can be scalar or vector-valued depending on the availability of a control input sample. Data fits at the Y layer are obtained by MOSAiCS [19] on individual samples and evaluated by the goodness-of-fit (GOF) plots. ChIP: chromatin immunoprecipitation; jMOSAiCS: joint model-based oneand two-sample analysis and inference for ChIP-seq; MOSAiCS: model-based analysis and inference for ChIP-seq data purposes only. We recommend that users download the most recent version of the software from Bioconductor [1].

Data-driven computational experiments
We evaluated jMOSAiCS with data-driven computational experiments by simulating multiple ChIP-seq datasets based on model fits on actual datasets. We utilized ChIP-Seq experiments of STAT1 binding in interferon-γ-stimulated HeLa S3 cells by [21], H3K9me3 (repression mark) modification in peripheral blood mononuclear cells (PBMCs) from two unrelated individuals (Bresnick Lab, UW Madison), and methyl CpG binding protein (MeCP2) in mouse cortex (Chang Lab, UW Madison). The model fits were obtained by MOSAiCS and the goodness-of-fit plots indicated satisfactory fits as discussed in [19]. We simulated multiple ChIP-seq datasets using parameters that matched observed values in the STAT1, H3K9me3 and MeCP2 ChIP-seq experiments. The density plots of the read counts from the actual and sample simulated data are provided in Additional file 3, Figure S1, and indicate that the simulated data mimics the actual data well. In what follows, we first compared jMOSAiCS with a commonly practiced separate analysis scheme where each ChIP-seq dataset is analyzed individually and the enrichment patterns are generated by post hoc analysis. Then, we compared jMOSAiCS to chromHMM [8], which is currently the state-of-the-art approach for discovering combinatorial patterns of chromatin states from multiple ChIP-seq data.
jMOSAiCS improves on a separate analysis of multiple ChIP-seq datasets Comparisons based on data-driven STAT1 experiments: analysis of multiple ChIP-seq datasets of two or more TFs under similar biological conditions Data for this experiment uses the actual input experiment as the control sample and emulates ChIP-seq of multiple transcription factors in a single biological condition. Since we repeated each simulation experiment multiple times to assess variability, we restricted our data generation process to chromosome 12 of the human genome to reduce computational time. We considered two settings with D = 2 and D = 3 datasets. The actual parameter values for each setting are summarized in Additional file 3, Table S1. For both settings, jMOSAiCS and the separate analysis approach, which identified enrichment for each individual dataset separately by MOSAiCS, are employed. Typical output from a ChIP-seq analysis is a ranked list of enriched regions. The length of the list can be based on a FDR cutoff, other types of type-I error rate control or the investigators may choose to consider a certain number of high ranking regions. We evaluated the joint and the separate analysis approaches by taking this variation in reporting of the results into consideration. Specifically, we considered: (i) accuracy by plotting the proportion of correctly detected enriched regions obtained by the B variable and also correctly detected enrichments obtained by dataset-specific E variables as a function of top ranking enrichment regions; (ii) sensitivity by plotting the proportion of the true set of enrichments that are detected as a function of the nominal false discovery rate (the total number of detected true enrichments identified at different FDR cutoffs divided by the total number of true enrichments are reported); (iii) false discovery rate control by plotting the observed FDR as a function of target nominal FDR. Ranking of regions and FDR control for jMOSAiCS relied on the posterior inference with the B variable, which captures whether or not any given region is enriched in any of the datasets, and the E variables, which infer whether or not the regions are enriched in specific datasets. We generated similar variables for the separate analysis in a post hoc fashion after individual samples were analyzed with MOSAiCS. Figure 2 summarizes these results for the D = 2 setting across 20 simulation runs (results for D = 3 are available in Additional file 3, Figure S2). This setting, on average, has 85,000 enriched regions, that is, regions with B = 1. Figure 2(a), which displays the proportion of top ranking enriched regions that are true positives, indicates that jMOSAiCS and the separate analysis exhibit similar accuracy for the top 36% of the enriched regions; however, jMOSAiCS outperforms the separate approach significantly as we go down the list of top ranking regions. The differences in performances are significant both at the region level (B level, based on the B variable) in detecting whether or not there is any enrichment in a region in any of the D datasets and also at the individual dataset level (E 1 and E 2 levels, based on the E variables). Beyond the 68% of the top enrichment regions (≥58,000), the improvement in accuracy due to the joint analysis is about 10% at the individual dataset level. In addition, jMOSAiCS exhibits much smaller variation in accuracy compared to the separate analysis as the number of top ranking regions considered increases. Since this setting had similar signal strengths for both datasets, dataset-specific accuracy improvements over the separate analysis captured by the E 1 and E 2 variables are similar. Figure 2(b) evaluates the two approaches in terms of sensitivity and illustrates that jMOSAiCS has better sensitivity than the separate approach at every nominal FDR level. Overall, jMOSAiCS identifies a larger number of enriched regions and captures a significantly higher proportion of the true set of enrichments compared to the separate approach at the same FDR level. When FDR is 0.01, the improvement in sensitivity is 9% at the B level and more than 15% at the E level. At the same FDR   cutoff, jMOSAiCS identifies more true enrichments than the separate analysis. Next, we show how well the FDR is controlled by the two approaches in Figure 2(c), which depicts observed FDR across 20 simulations for different levels of nominal FDR. Overall, we observe that jMO-SAiCS provides better FDR control than the separate approach and its FDR estimates at the E level are more accurate. For the B level, we observe some overestimation of FDR by jMOSAiCS; however, this still presents significant improvement over the separate analysis. Overall conclusions based on the H3K9me3 simulations, which emulate data for a single epigenetic mark in two different conditions (two different individuals), agree with those of STAT1 results and the detailed results are provided in Additional file 3, Figure S3.
Comparisons based on data-driven MeCP2 experiments: joint analysis of replicate ChIP-seq experiments ChIP-seq experiments are often carried out with at least two biological replicates to allow an assessment of variability. Prior research suggests that non-specific biases such as GC content can vary significantly between biological replicates [19,22]. As a result, it is not often clear whether or not data can be pooled at the biological replicate level for the purpose of identifying enrichment. We studied a joint analysis strategy of multiple replicates with jMOSAiCS for a computational experiment based on MeCP2 binding in mice. The data consisted of two biological replicates with five and six lanes of sequencing reads, respectively. The number of usable reads within a lane varied between 6.8 and 19.7 million. MOSAiCS provided adequate fits on each dataset and the simulation parameters were set according to estimates from the MOSAiCS fits. Details on the parameter settings are available in Additional file 3, Table S1. Within this simulation, we varied the sequencing depth of one of the replicates (replicate 2) at one, three, and six lanes while keeping the other replicate at five lanes. One-and threelane scenarios emulate the cases where one of the replicates has much lower sequencing depth than the other. This setting can arise in a variety of contexts, for example, when multiple samples are multiplexed together in one lane or when replicates are generated at different times. Figures 3, 4, and 5 summarize the results for these experiments. Figure 3(a) illustrates that, for lower depth scenarios of replicate 2, jMOSAiCS has significantly higher accuracy than the separate analysis at the B level when inferring whether the regions are enriched in any of the replicate datasets. E-level comparisons of accuracy for replicate 2 ( Figure 5(a)) reveal a consistent 15% difference in accuracy between jMOSAiCS and the separate approach. When both replicates have high sequencing depths, jMO-SAiCS provides a small but significant improvement over the separate analysis (jMOSAiCS (5-6) vs. Separate (5)(6) across Figures 3(a), 4(a), and 5(a)). The differences in the sensitivities of the two approaches vary significantly with the number of lanes of replicate 2 (Figures 3(b), 4(b), and 5(b)). Overall, jMOSAiCS consistently identifies 10% to 15% more of the true enrichments when replicate 2 has lower depth. In Figure 4, as expected, the sensitivity of enrichment detection in replicate 1 is not affected by the number of lanes of replicate 2 in the separate analysis. However, jMOSAiCS also improves on this replicate as the number of lanes for the other replicate increases by sharing information across the two replicates through the B variable. The largest improvement due to jMOSAiCS is in the detection of enriched regions in the low depth replicate when it has only one lane of data ( Figure 5(b)). In this setting, jMOSAiCS identifies 50% more of the true enrichment regions across all the nominal FDR levels. In Figures 3(c), 4(c), and 5(c), we observe that jMOSAiCS generally has more variable but accurate FDR estimation for both the B and E levels. When replicate 1 has five lanes and replicate 2 only one lane, FDR controls by jMOSAiCS for the B and E 2 levels are less accurate; however, the overall accuracy of jMOSAiCS is significantly better when a fixed number of top ranking regions are considered (Figures 3(a) and 5(a)).
We also carried out a variation of this experimental setting by lowering the sequencing depths of both of the replicates to one and three lanes. The results are reported in Additional file 3, Figures S4, S5, and S6, and agree well with the overall conclusions reported here.
Comparison with chromHMM chromHMM [8] is a hidden Markov model-based approach for partitioning a reference genome into multiple chromatin states based on multiple histone modification ChIP-seq datasets. The software accepts as input either aligned read files or enrichment/peak calls for each dataset. When provided with the aligned reads, it partitions the genome into 200 bp intervals and assigns each interval a 1 or 0 based on a local Poisson background distribution to depict enrichment. chromHMM aims to identify global patterns of enrichment and hence it approximates the space of two-dimensional enrichment patterns with a much smaller number as it is computationally prohibitive to consider the full state space with this model. As output, it reports the specific combination of epigenomic marks (enrichment patterns) associated with each chromatin state and the frequencies between 0 and 1 with which they occur. We compared jMOSAiCS and chromHMM in three settings using the data-driven experiments of STAT1 ChIP-seq data in HeLa cells. Although these initial parameters are derived from TF ChIP-seq data, they are able to generate ChIP-seq data with marginal density similar to those c d  SE1: Same as the STAT1 simulation described in the earlier section. SE2: η 2 lowered from 0.9 to 0.5 to increase the number of regions with ten patterns. SE3: Strengthened the ChIP signal by substituting b 1 and b 2 with 2 × b 1 and 2 × b 2 .
One of the major differences between chromHMM and jMOSAiCS is that chromHMM models binary enrichment indicators as the observable data whereas jMOSAiCS models the actual read counts (Y layer). In addition, jMOSAiCS can capture all possible enrichment patterns even for a large number of datasets (D) because the joint distribution of the enrichment variables is governed by the univariate B variable. To investigate the effect of the binarization in chromHMM, we considered three versions of chromHMM: (i) original chromHMM; (ii) chromHMM coupled with true binarization; (iii) chromHMM where bin-level binarization is based on peak calling with MOSAiCS at nominal FDR levels of 0.05 and 0.2. Detailed results for setting SE2 are provided in Figure 6.   misannotated by chromHMM. Overall, these results illustrate that jMOSAiCS outperforms four-state chromHMM in this setting. When coupled with true binary data, chromHMM annotated all chromatin states accurately. Using peaks called by MOSAiCS increased the accuracy compared to the original four-state chromHMM but identified fewer correct regions in the 11 state. Figure 6(b) provides detailed comparison of jMOSAiCS with the two-state chromHMM where chromHMM approximates the full state space of dimension 4 by only two states. A similar comparison between jMOSAiCS and four-state chromHMM is provided in Additional file 3, Figure S7. We observe that approximating the state space of dimension of 4 by two dimensions leads to significant loss in accuracy for chromHMM. At the individual dataset level, the difference in accuracy between two-state chromHMM and jMOSAiCS can be as large as 20% (comparing jMOSAiCS-E2 with chromHMM-E2 in Figure 6(b)). The results for simulation settings SE1 and SE3 are similar and provided as Additional file 3, Figures S8 and S9.

Scalability with large numbers of datasets
We evaluated how well jMOSAiCS scales up to large numbers of datasets by extending our simulation setting SE3 with 5b 1 and 5b 2 . We generated D = 20 datasets and assigned each region of size 250 bp to one of the 76 states out of 2 20 possible states, including the 0...0, non-enrichment state. The number of regions assigned to each individual enrichment state ranged between 1,048 and 1,212 and, on average, each state had 1,129 assigned regions. These experiments revealed that because the dataset-specific background and signal read distributions are estimated separately in the jMOSAiCS framework, it scales up easily to large numbers of datasets. Thus, for D = 20 datasets, jMOSAiCS runs in 2 hours on a 64-bit machine with an Intel Xeon 3.0 GHz processor after the background and signal read distributions are obtained for each dataset. Dataset-specific estimation with MOSAiCS requires 30 minutes to an hour; however, since each dataset can be handled separately, this process can be run in parallel using multiple CPUs. Although jMOSAiCS can generate any number of states for large D, it is important to summarize key states for any given collection of datasets. One key advantage of jMOSAiCS is that the model fitting can be carried about without a priori setting the maximum number of allowed states. After the model is fit, each region is initially assigned to the state for which it has the largest posterior probability. However, if a much smaller number of states is desired, we consider the top K states with the largest number of region assignments and reassign each region to these K states. From the perspective of jMOSAiCS, summarizing the set of possible states with K states is analogous to choosing the number of clusters in a clustering problem. We have implemented a penalized average silhouette based criterion [23], which is widely used in clustering and seems to work well for our large D simulations (Additional file 3, Figure S10).
We next evaluated the accuracy and sensitivity for both jMOSAiCS and chromHMM by varying the maximum number of allowed states, K max , as 30, 76, and 100 (Figures 7(a) and 7(b)). Overall, jMOSAiCS has very good accuracy and sensitivity when the number of states is chosen optimally or overestimated. However, when the number of states is grossly underestimated, the accuracy is comparable to those of large numbers of states for the top set of detected enriched regions (approximately 35%) and then it rapidly deteriorates since the small number of states simply does not capture the state of many regions. In order to evaluate chromHMM under similar conditions, we performed chromHMM analysis by requiring 30, 76, and 100 states and thresholded the resulting emission probabilities to generate distinct states. This resulted in a total of 27, 63, and 85 distinct states.
Application to mouse ENCODE data of multiple histone modifications during erythroid differentiation We applied jMOSAiCS to ChIP-seq data with antibodies specific to the histone modifications H3K4me3, H3K4me1, H3K27me3, and H3K9me3 in G1E and G1E-ER4+E2 cells [6]. These data were generated as part of the mouse ENCODE project and analyzed by chromHMM to segment the mouse erythroid genome based on chromatin modifications in [6]. The original analysis by [6] focused on segmentation of GATA1-occupied segments since G1E cells are a GATA-(null) cell line derived from targeted disruption of GATA1 in embryonic stem cells whereas G1E-ER4 cells are G1E cells engineered to express a conditionally active estrogen receptor (ER) ligand binding domain fusion to GATA1 (ER-GATA1). When estradiol is added to the culture medium (G1E-ER4+E2), the ER-GATA1 fusion protein gets activated and binds to GATA1-specific sites. chromHMM analysis approximated a 2 4 = 16 dimensional state space with only six states. Our jMOSAiCS application explored the full state space and, in addition to the six states identified by chromHMM, identified five more states to which a significant number of GATA1-occupied segments were assigned. Figure 8(a) enumerates the state space for jMOSAiCS and Figure 8(b) lists the number of GATA1-occupied segments for each state in the G1E and G1E-ER4+E2 cells. Overall, we observe that chromHMM captures broad dominating patterns and jMOSAiCS improves resolution for identifying local structures. In Figure 8(c), we provide normalized read data for the 311 GATA1-occupied peaks (with width less than 1,400 bp out of a total of 366) identified to switch from state 1101 in G1E to state 1111 in G1E-ER4+E2. We note that the chromHMM output does not include the 1101 or the 1111 pattern and distributes these loci over the six patterns it utilizes. However, as evidenced from the heatmaps, these GATA1-occupied segments lack the repressive mark H3K27me3 in G1E cells and exhibit the mark upon activation of GATA1 in G1E-ER4+E2.
We annotated these GATA1-occupied segments with respect to gene location and identified that a large subset of them (48%) map to the immediate 5' or 3' end, or within introns of known genes. We studied expression profiling data from GATA1-null erythroid precursor cells that stably express a conditionally active allele of GATA1 fused to the estrogen receptor ligand binding domain (G1E-ER-GATA-1). Differential expression analysis of uninduced and beta-estradiol-induced G1E-ER-GATA-1 cells [24] identified Elf1, Atp6v1e1, Cmas, Ech1, Extl3, Rab4a, Casc3, and Lrrf1p2 as significantly induced upon GATA1 activation with beta-estradiol treatment for 24 hours (FDR adjusted P value = 0.05). Although H3K27me3 is conventionally viewed as inhibitory to transcription, [25] recently identified an enrichment profile of H3K27me3 in the promoter of genes associated with active transcription. The genes we identified constitute further examples of this class. Several of these significantly expressed genes have established functions in stem cell biology and hematopoiesis. For example, Elf1 is an Ets transcription factor involved in the control of hematopoiesis through participating in the transcriptional activation of the Stem Cell Leukemia (SCL)/T-cell Acute Lymphocytic Leukemia-1 (TAL1) gene [26,27]. We performed quantitative ChIP analysis of these four loci and validated the H3K4me1, H3K4me3, H3K27me3, and H3K9me3 marks at these loci in beta-estradiol-induced G1E-ER-GATA-1 cells (Additional file 3, Table S2). We provide detailed read coverage plots of these regions in Additional file 3, Figures S11 to S14 along with their chromHMM annotations to further support their jMOSAiCS annotation.
In addition to the above direct comparison of jMO-SAiCS analysis with the results of [6] for a six-state chromHMM, we repeated the chromHMM analysis by requiring 16 states. We cross-tabulated the numbers of GATA1-occupied segments assigned to each of the 16 states by the two methods in Additional file 3, Figures  S15(a) and (b) for the G1E and G1E-ER4+E2 cell lines, respectively. Overall, 59% of regions are assigned to the same chromatin state by both methods in both cell lines (G1E: 6774/ 11485 and G1E-ER4+E2: 6752/11485). The largest discordances between the two methods are due to the modification of the H4K3me3 mark. Most of the regions that are identified as unenriched for the H4K3me3 in chromHMM are identified as enriched by jMOSAiCS. In order to quantify this further, we marginally tabulated the regions according to their enrichment classification by both chromHMM and jMOSAiCS (Additional file 3, Table S4). Next, we investigated the raw data (both ChIP read counts and log base 2 ratios of ChIP over scaled input read counts) within each of these classes. Additional file 3, Figures S16(a) and (b) display boxplots of log base 2 ratios for 11 (detected as enriched by both jMOSAiCS and chromHMM), 10 (detected as enriched only by jMOSAiCS), 01 (detected as enriched only by chromHMM), and 00 (not detected as enriched by either method) groups within each mark across the two cell lines. For each GATA1-occupied segment, we used the bin with the maximum log base 2 ratio to generate these plots. Overall, we observed that regions declared as enriched by both methods showed the most enrichment of ChIP compared to input and the regions identified as unenriched by both methods showed the least enrichment. Although the differences between 10 (jMOSAiCS specific) and unenriched regions did not appear as pronounced as the differences between 11 (enriched according to both methods) versus 00 (unenriched according to both methods), the observed differences were highly significant (P values < 1 × e -10 with both the t-test and Wilcoxon rank-sum test), indicating that these regions are enriched for H4K3me3. In addition, we observed a relatively small number of regions identified as enriched exclusively by chromHMM. Although some of these appear to be true false negatives for jMOSAiCS, overall, they tend to be regions with relatively low ChIP counts. Additional file 3, Figure S16(c) displays ChIP versus normalized input read counts across all the regions, stratified with respect to their enrichment status by jMOSAiCS and chromHMM, for the H3K4me1 mark in G1E-ER4+E2 cells. We also evaluated how the performance of chromHMM changed for the segments that were identified as changing from 1101 in G1E to 1111 in G1E-ER4 +E2 by jMOSAiCS. Although the 16-state chromHMM produced concordant results with jMOSAiCS for 58.5% of these GATA1-occupied loci, including the validated Atp6v1e and Cmas loci, it still mislabeled both Elf1 and Extl3 (Additional file 3, Table S5).

Discussion
Integrative analysis of multiple ChIP-seq datasets for enumerating enrichment patterns is an emerging need. We introduced jMOSAiCS to enable efficient one-or two-sample integrative analysis of multiple ChIP-seq datasets. jMOSAiCS capitalizes on the dataset-specific accurate model fits by MOSAiCS and efficient encoding of the joint distribution of the enrichment across multiple datasets by the JAMIE approach of [18]. Diagnostics is an important component of probabilistic model-based approaches. jMOSAiCS inherited the goodness-of-fit plots provided by MOSAiCS for model checking and diagnostics. In contrast to some of the few available joint analysis methods for multiple ChIP-seq data (for example, [11]), jMOSAiCS can efficiently handle multiple datasets and is accurate at both obtaining global and local structures. A comparison of jMOSAiCS with chromHMM revealed that jMOSAiCS is better at identifying local structures since it can capture any specific enrichment pattern and does not rely on approximating the number of states with a smaller number of patterns. This observation is further supported by identification of a considerable number of GATA1-occupied segments in a different state than was identified by chromHMM.
Our computational experiments indicated that jMO-SAiCS scales up well with large numbers of datasets and it can summarize key states with a penalized average silhouette criterion [23]. Our analyses illustrated that jMOSAiCS is powerful in analyzing biological replicates simultaneously when it is not appropriate to pool them due to non-specific sequencing biases such as the GC content. When one or more of the replicates is shallowly sequenced compared to others, jMOSAiCS boosts the power for these replicates. Another particularly attractive use for jMOSAiCS is when the TF of interest interacts with the reference genome through another DNA binding protein. For example, virus-host interactions are typically facilitated by virus proteins interacting with the host DNA via host proteins. Joint analysis of ChIP-seq data for host and virus proteins has the potential to boost power for detecting regions enriched for the virus protein (for example, [28]).
Meta-analysis of multiple samples is another integrative approach to multiple ChIP-seq samples. However, the focus of such meta approaches (for example, MM-ChIP [29] and ChIPMeta [30]) is the analysis of ChIP (-chip or -seq) data of the same protein under similar biological conditions but by different platforms or laboratories for the purpose of boosting the power of peak detection. The focus in jMOSAiCS is combinatorial pattern detection across multiple datasets (same TF in different biological conditions or different TFs or epigenomic marks in the same biological conditions). Therefore, our computational experiments focused on comparing jMOSAiCS with chromHMM, which is suitable for the latter task. jMOSAiCS can handle multiple ChIP-seq datasets with varying experimental parameters such library size and read length because the marginal distributions of read counts in each dataset are modeled in a dataset-specific manner.
jMOSAiCS currently implements a naive Bayes model for the joint distribution of the dataset-specific enrichment indicators. This model captures broad dependencies among the samples via an unobserved variable. A potential improvement is to consider how enrichment of a region in a sample depends on its enrichment in other samples. A general way to induce such a structure is by Bayesian networks, where a directed acyclic graph represents the dependencies. Trees, which generalize firstorder Markov chains, and mixtures of trees for which efficient structure learning algorithms exist [31] are two appealing, flexible candidates that can encode for increasingly complex dependencies. Furthermore, they can be tailored for specific characteristics of analyzed samples, for example, a Markov structure for time course ChIPseq experiments.

Conclusion
jMOSAiCS facilitates joint analysis of multiple ChIP-seq datasets for both identifying enrichment patterns of a single TF across multiple conditions and characterizing enrichment patterns of multiple epigenomic marks in one or more conditions. Given model fits from the peak/enrichment caller MOSAiCS, a typical jMOSAiCS run takes about 30 minutes (2 hours) to identify combinatorial patterns of four (twenty) datasets across the whole mouse genome with a single CPU on a 64-bit machine with an Intel Xeon 3.0 GHz processor.

Model fitting and parameter estimation in jMOSAiCS
Let f 0d and f 1d denote read count distributions for unenriched and enriched bins in dataset d. We will denote estimates of these by MOSAiCS withf 0d andf 1d . When a region is not enriched in dataset d, data for all the bins within that region are generated from f 0d . Hence: If region i is enriched in dataset d, then read counts for one or more consecutive bins within region i are generated from f 1d . This enforces local spatial coherence and is motivated by the wide range of enriched region widths observed in ChIP-seq data of histone modifications. Note that this kind of spatial dependence is also capture by the chromHMM model. Let V id denote the number of enriched bins and S id ∈ {1, . . . , L i } the starting position of the set of enriched bins in region i. Then, we have: where we assume that the run of enriched bins can start anywhere within the region with equal probability of 1/L i and the length of the run has a uniform discrete distribution, that is, Pr(S id = s | E id = 1, B i = 1, V id = v) = 1/(L i -v + 1), s = 1, ..., L i -v + 1. The likelihood of full data is a product over I regions: Pr (Yi, Ei, Bi) We estimate f 0d and f 1d for each individual dataset separately using the MOSAiCS algorithm. Therefore, the quantities p 0id and p 1id , i = 1, ..., I, d = 1, ..., D are fixed givenf 0d andf 1d . Because B, E, S, and V are unobserved variables, we derive an expectation-maximization [32] algorithm to obtain maximum likelihood estimators of τ 1 and η = (η 1 , ..., η d ) based on the likelihood in (1). The full data log likelihood can be written as: where C is a constant that does not contain the parameters to be estimated and can be computed givenf 0d andf 1d . Taking expectation of the full data likelihood conditional on observed read counts Y, we obtain the following E and M steps, where τ t 1 , η t denote parameter estimates from the tth iteration: E step: This EM algorithm converged within 100 iterations in both the computational experiments and the analysis of ChIP-seq data of histone modifications used in the case study. We used the posterior probabilities Pr B i |Y i ,τ 1 ,η and Pr E id |Y i ,τ 1 ,η for false discovery rate control with a direct posterior probability approach [33] in the computational experiments.

Computational experiments
All the computational experiments were based on the following procedure. The reference genome (human for STAT1 and H3K9me3 or mouse for MeCP2) was divided into bins (50 bp for STAT1, 250 bp for H3K9me3, and 200 bp for MeCP2) based on average fragment size in the actual experiment. Consecutive n ∈ {3, 5} bins were organized into non-overlapping regions to facilitate B-level data generation. For each region i, i = 1..., I, the B i variable was set to 1 with probability τ 1 . If B i = 0, then all the E id and Z idj variables were set to 0 for that region, indicating no enrichment for all the bins in the region across all the datasets. For regions with B i = 1, the E variable was simulated at the dataset level, for example, E id was set to 1 with probability η d . The bin-level Z variables were generated based on E id . For E id = 1, the region i should have at least one enriched bin in dataset d. To ensure this, we selected the bin that the enrichment starts within in a region at random and allowed the number of consecutive bins with enrichment to vary within each region. For non-enriched bins, Z idj was set to 0 and the corresponding Y layer data (read counts) were generated from the background distribution. For enriched bins, Z idj was set to 1 or 2 with probabilities p 1 and 1 -p 1 , and denoted the components of the mixture distribution for the signal. Specifically, Z idj = 1 implied that Y idjÑ idj + NegBin(b 1 , c 1 /(1 + c 1 )), whereas Z idj = 2 referred to Y idj~Nidj + NegBin(b 2 , c 2 /(1 + c 2 )). We generated multiple ChIP-seq datasets by varying the signal component parameters b 1 , b 2 , c 1 , c 2 , and p 1 of this procedure according to the parameters estimated from the actual ChIP-seq studies (Additional file 3, Table S1).

Separate analysis of multiple ChIP-seq datasets and annotation of genomes into combinatorial patterns in the computational experiments
In the separate analysis, we analyzed each dataset by MOSAiCS [19]. This allowed us to quantify the gain due to the joint modeling approach rather than differences in modeling the read count data by different ChIP-seq analysis methods. MOSAiCS reports bin-level posterior probabilities of enrichment (posterior probabilities at the Z layer). For the sensitivity and empirical FDR calculations, enriched bins were identified at the various levels of nominal FDR using a direct posterior probability approach [33]. Then, dataset-specific E variables were set to 1 if there was at least one enriched bin in a region. Similarly, region-specific B variables were set to 1 if at least one of the E variables for a given region was set to 1. The accuracy calculations required ranking of regions based on the B and E variables. For this purpose, we followed a meta-analytic approach and used the maximum of bin-level posterior probabilities of enrichment within each region for inference at the E level and the maximum within each region across D datasets for inference at the B level. Then, these posterior probabilities were used for ranking the regions in the accuracy plots. We also considered FDR control over these meta-analytically defined B and E variables as an alternative to the above approach for identifying the set of enriched regions in the separate analysis; however, this modification yielded similar results and did not change the overall conclusions. Ranking for the joint analysis in the accuracy plots utilized posterior inferences for the B and E variables based on the jMOSAiCS model. Accuracy as a function of the top number of detected enriched regions required ranking of regions by chromHMM. For each region, we summed over chromHMM estimated pattern probability times the pattern-specific emission probability of each bin within the region and generated pattern-specific posterior probabilities for ranking.
Comparison of chromHMM and jMOSAiCS required annotation of the genome into TF binding/chromatin states based on the jMOSAiCS fit. We calculated the joint posterior probability of the E variables Pr(E i1 = r 1 , ..., E iD = r D | Y i , τ 1 , η) for each combination of r 1 , ..., r D , where r i = 0, 1. The enrichment pattern (or state) of each region is assigned as the one with the maximum joint posterior probability.
jMOSAiCS analysis of multiple histone modification ChIPseq datasets from [6] We partitioned the mouse genome into 200 bp intervals and applied jMOSAiCS to data from the G1E and G1E-ER4+E2 cells separately. Enriched regions were identified by controlling the FDR at 0.01 through the E variable. In the downstream analysis, we focused on 11,485 GATA1-occupied segments defined by [6] and enumerated H3K4me3, H3K4me1, H3K27me3, and H3K9me3 modification patterns of these regions across the two cell types. The median width of the GATA1-occupied segments was 800 bp and only 0.75% of the segments were wider than 2,000 bp.

Quantitative ChIP assay
Quantitative ChIP analysis was conducted with two independent biological replicates of beta-estradiolinduced G1E-ER-GATA-1 cells using control and specific antibodies as described in [34]. The relative levels of the specific histone marks are indicated in the Additional file 3, Table S2. The PCR primers used to analyze the four loci are provided in Additional file 3, Table S3.

Additional material
Additional file 1: R package for jMOSAiCS. Abbreviations bp: base pair; ChIP: chromatin immunoprecipitation; ENCODE: Encyclopedia of DNA Elements; ER: estrogen receptor; FDR: false discovery rate; GC: guanine-cytosine; jMOSAiCS: joint model-based one-and two-sample analysis and inference for ChIP-seq; MeCP2: methyl CpG binding protein 2; MOSAiCS: model-based analysis and inference for ChIP-seq data; NGS: next generation sequencing; PBMC: peripheral blood mononuclear cell; PCR: polymerase chain reaction; Pol II: polymerase II; TF: transcription factor.
Authors' contributions SK conceived and designed the method, computational experiments, and the data analysis and wrote the paper. XZ designed and implemented the method, computational experiments, and the data analysis, and wrote the paper. RS and EHB performed experimental validation of the selected targets. HL and QC contributed data. All authors read and approved the final manuscript.