Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is an increasingly common experimental approach to generate genome-wide maps of histone modifications and to dissect the complexity of the epigenome. Here, we propose EpiCSeg: a novel algorithm that combines several histone modification maps for the segmentation and characterization of cell-type specific epigenomic landscapes. By using an accurate probabilistic model for the read counts, EpiCSeg provides a useful annotation for a considerably larger portion of the genome, shows a stronger association with validation data, and yields more consistent predictions across replicate experiments when compared to existing methods. The software is available at http://github.com/lamortenera/epicseg Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0708-z) contains supplementary material, which is available to authorized users.

• Initializing the parameters for k chromatin states can be done by clustering the observation vectors into k clusters and therefore disregarding the order of the observations.
• Many small clusters (seeds) can be merged and reduced to k clusters using hierarchical clustering.
• Principal Component Analysis (PCA) can be applied on the count matrix to find good seeds.
Let k be the desired number of clusters and C the matrix with n columns and L = g i=1 L i rows obtained by combining all matrices C (i) . C and k are the input of the algorithm while the output are k sets of bins. These sets are not exactly a clustering, because they can overlap, and not even proper sets, as the same bin in the same set can appear more than once. However it is easy to fit the probabilistic model starting from them (not discussed).
The algorithm performs the following steps: 1. PCA is performed on the count matrix C. This results in a coordinate matrix P with the same dimensions as C but where the columns correspond to principal components (PCs).
2. For each PC, nlev seeds are computed. Here seed means a subset of the bins and nlev is a parameter of the initialization algorithm. The seeds deriving from a PC p result from partitioning the rows of matrix P into nlev groups of equal size according to the intensity of the p-th column. So, for instance, the first group is formed by the L/nlev rows with the lowest values in column p. This yields a total of nlev · k seeds.
3. For each seed the parameters of a negative multinomial distribution N M (µ, r, p 1 , p 2 , ...p n ) are determined. The r parameter is fitted for all seeds according to some properties of matrix C (not discussed), while the other parameters are estimated by maximum likelihood on the data points identified by the seed. 4. A distance matrix between seeds is computed. The distance between two seeds is defined as the symmetrized Kullback-Leibler divergence between the two corresponding negative multinomial distributions.
5. The distance matrix is used as input to the hierarchical clustering algorithm with average linkage. The hclust function available in R is used. Hierarchical clustering produces a tree where the leaves represent the different seeds, internal nodes represent clusters of seeds and where the length of a branch represents the similarity between two nodes. 2 6. The tree produced by the hierarchical clustering algorithm is cut at a distance from the root such that the resulting tree has exactly k nodes. Each node is a group of seeds that will be used to initialize a chromatin state.

The probabilistic model of EpiCSeg
Hidden Markov Models (HMMs) are very well known in literature and many of the equations presented here can be found in textbooks covering the topic, such as [1]. A brief introduction, however, is necessary to present the update equations for the negative multinomial distributions, which represent a novel contribution of this work.

Model definition
For ease of discussion we will assume that there is only one count matrix C corresponding to one genomic region with a total of L bins. The t-th row of the count matrix, where 1 ≤ t ≤ L, will be denoted as c t , the count in the t-th bin corresponding to the m-th mark, where 1 ≤ m ≤ n, will be denoted as c tm and the sum of the counts in the t-th row will be denoted as c t+ , i.e. c t+ = n m=1 c tm . The main assumption of the model is that there are k hidden states and that the sequence of L observations c t is explained by a hidden sequence of L states. The random variable X t denotes the hidden state at position t and the random variable Y t represents the observation at time t (in this context Y t is always known and equals c t , but the random variable notation is useful to explain the probabilistic model).
The complete set of model parameters, denoted as θ, consists of the parameters π, A, M 1 , M 2 , ...M k , where: • π, the initial probabilities, are a vector of k probabilities summing up to one where π s specifies P rob{X 1 = s|θ}.
• A, the transition probabilities, are a square matrix of size k, where the element in row u and column v, denoted as a uv , specifies P rob{X t+1 = v|X t = u, θ}, independently of the position t in the sequence. In A each row u sums up to one, i.e. k v=1 a uv = 1.
• M s is the parameter set that determines the emission probabilities relative to state s, which follows a Negative Multinomial distribution with parameters µ s , r s , p s1 , p s2 , ...p sn . The Negative Multinomial distribution can be seen as a combination between the Negative Binomial distribution, specified by parameters µ s and r s , and the Multinomial distribution, specified by the parameters p s1 , p s2 , ...p sn . In formulas In an important variant of the EpiCSeg model all r s variables are constrained to have the same value. This strategy has proved effective in avoiding overfitting and excluding some unrealistic models where different states s have wildly different values of the parameter r s . This variant of the model is called the dependent mode, while the variant where the r s parameters are independent is called the independent mode. Unless otherwise specified this discussion focuses on the independent mode. Given this model, the probability that the sequence of observations Y = Y 1 , Y 2 , ...Y L equals the count matrix (i.e., that Y t = c t for 1 ≤ t ≤ L), and that the sequence of hidden states X = X 1 , X 2 , ...X L equals a certain sequence s = s 1 , s 2 , ..s L , with s ∈ {1, 2, ...k} L , is given by and the overall probability of the observed data can be obtained by summing the contributions of all possible paths:

Parameter estimation through Expectation Maximization
To estimate the parameters of the model we follow a Maximum Likelihood approach, i.e. we seek to estimate the set of parameters θ (opt) that maximizes the overall probability of the observed data Unfortunately this is a very difficult problem which typically needs to be dealt with with some iterative procedure, such as Expectation Maximization (EM). With Expectation Maximization, starting from a parameter set θ (1) , a sequence of parameter sets θ (2) , θ (3) , θ (4) ... is computed using the following update rule: It can be shown that using this strategy P rob{Y = C|θ i } is always less or equal to P rob{Y = C|θ (i+1) } (see, for instance, [3]). The sequence typically stops when there is no further increase in the objective function and the algorithm is said to converge. The major drawbacks of this approach are that the final parameters of the sequence are not guaranteed to be a global optimum and different initial parameters can lead to different final parameters. The Expectation Maximization strategy applied to Hidden Markov Models is known as the Baum-Welch algorithm.

The Baum-Welch algorithm
The Baum-Welch algorithm uses the posterior probabilities γ s (t) and the quantities ξ uv (t). They depend on the last set of parameters θ (i) and they are defined in the following way: Those quantities, which are computed using the Forward-Backward strategy (see [1]), allow to separate the maximization problem in Equation 2 into three independent and simpler maximization problems: Solving the maximization problem in Equation 3 separately for each parameter yields the following well known update rules for the initial and transition probabilities:

Update rules for the Negative Multinomial distribution
The factorization shown in Equation 1 allows to further split the maximization problem into This yields: , and for the r s parameters: Unfortunately there is no closed formula for the last maximization problem, which needs to be solved numerically as a one dimensional optimization problem. It is known, however, that there exist only one local maximum, which is also a global maximum [2]. In EpiCSeg a variant of the Brent algorithm is used for this task.
In the dependent mode of the EpiCSeg model, i.e. when all parameters r s are constrained to have the same value r, all the above update equations remain valid except Equation 4, which becomes and can be solved numerically as in the previous case.

Computational considerations
The update formulas 4 and 5 can be very costly to compute. The optimization function consists of a summation over all the L bins of a quantity that depends on the f N B function, which is a very costly function. Numerical methods need to evaluate the optimization function, or its derivative, a number N r of times before returning the updated value for r s , where typical values for N r range from 10 to 30 iterations, and where these evaluations need to be done serially, i.e. they cannot be executed in parallel. Assuming that L ∼ 1.5 · 10 7 (as for a whole human genome and with a bin size of 200 bps) and N t ∼ 20, the Expectation Maximization algorithm would need to evaluate the f N B function about 3 · 10 8 times at each iteration. This problem can be alleviated by grouping together evaluations of the f N B (c t+ ; µ (i+1) s , r s ) function with the same c t+ value. Let , where the expression delimited by square brackets evaluates to one when the expression inside it is true and to zero otherwise. Then the update formulas 4 and 5 can be rewritten respectively as and respectively. Considering that D, even in genome-wide datasets, rarely contains more than 10 4 elements, the f N B function now needs to be evaluated 1.5 · 10 3 times less frequently.

Supervised annotation
The aim of the supervised annotation is to provide a benchmark dataset to evaluate the efficiency of a segmentation algorithm. This section describes the procedure used to derive such an annotation. The input to the procedure is a RNA-seq and a DNase-I hypersensitivity experiment from the same cell type as the histone mark experiments used for the segmentation. Additionally the procedure uses the annotated transcripts available from the GENCODE database, version 19 (Ensembl 74). More precisely, all annotations of type "transcript" and of level 1 and 2 where considered. The binning scheme used for this procedure is the same as the one used for segmentation, i.e. the bin have size 200 base pairs, with the difference that all bins overlapping a non-assembled region of the genome were discarded.

RNA-seq processing
In the RNA-seq tracks, which are paired-end sequencing experiments, regions ranging from the leftmost to the rightmost coordinate of a mapped read pair were treated as unstranded transcribed regions. The coverage per base pair was computed as the number of such transcribed regions that overlap the base pair, and the transcription signal per bin was defined as the average coverage in the bin. The top 15% bins were considered transcribed.

DNase-I HS processing
In the DNase-I hypersensitivity tracks, which are single-end sequencing experiments, the accessibility signal per bin was defined as the number of reads with a 5' end mapping within the bin. The top 2% bins were considered accessible and the top 0.8% were considered strongly accessible.

Annotation criteria
The following criteria were used to define each chromatin environment: • A RNA bin was defined as a transcribed and non-accessible bin and such that all the 10 bins to the right and to the left are also transcribed and non-accessible.
• A DNase bin was defined as a strongly accessible bin. If the bin is closer than 500 base pairs to an annotated TSS it was considered a DNase+TSS bin, otherwise a DNase-TSS bin.
• A bin was annotated as intergenic if that bin, as well as the 100 bins to the left and to the right, are neither a DNase nor a RNA bin. However, we needed to correct for the fact that the distribution of the chromatin environments is highly imbalanced. Some environments, such as the intergenic environment, occur much more often and influence the performance score much more that others, such as the DNase+TSS environment. This leads to large oscillations of the score with small perturbations of the input data and to the paradox that the most biologically interesting states, such as those related to the DNase+TSS environment, play a lesser role in the performance score. To correct for that we use proportions P normalized for the chromatin environment frequency:P ES (e, s) = P ES (e, s)/{P E (s)|E|}, and P E ,P S are defined starting fromP ES as above. The (corrected) mutual information (see Figure 2) is defined as above, but using the corrected frequencies: Note that if (X E , X S ) are two random variables with joint probability distribution given byP ES , the above formula can be related to the conditional entropy of X E given X S : Lastly, we measured how similar each pair of chromatin environments is in terms of chromatin states frequency (see Figure 3) using the following measure, that we called mutual similarity: The runtime of the two algorithms was measured by running a whole-genome segmentation with 10 states on each dataset. Each genome-wide dataset consists of a matrix with 15181508 bins (rows) and with 27, 26, 12 and 12 marks (columns) for the IMR90, H1, K562_1 and K562_2 datasets respectively. We used the default parameters for each algorithm except the number of cores, which was set to 10. The computer used for the comparison is a 64x AMD Opteron 6282 SE with 517 GB or RAM (even though each algorithm needed at most 2 GB). The result of the measurements can be seen in Figure 4, and it suggests that both ChromHMM and EpiCSeg are fast enough for our needs.

Scores dependence on the number of states
We measured the scores reported in the main document varying the number of states from 2 to 40. See the main document for a description of each score. For practical reasons we limited the input data to the first chromosome only (chr1). The low R 2 with the IMR90 dataset is most likely due to a much smaller sequencing coverage of the RNA-seq experiment (roughly 10 times less reads than in the other datasets).

Comparison with Segway
The comparison with Segway poses a challenge both because of its larger runtime and because it works at a single base pair resolution. While also EpiCSeg and ChromHMM can run at a single base pair resolution, aggregating counts from adjacent base pairs has important smoothing properties, both for the segmentation algorithms and for our validation procedure. These smoothing properties play an indispensable role in our view of chromatin states. To mention one of these properties, binning reduces the oscillations in read counts due to nucleosome positioning. Even though in other contexts it might be desirable to discern between linker and nucleosomal DNA, we consider a promoter region, typically characterized by an array of positioned nucleosomes, as a single entity, which should be annotated by a single segment.
In the computation of our performance scores, for Segway, we converted the state assignments to each base into state assignments to each bin. This was done by picking the most abundant state in each bin (draws, which occurred for less than 1% of the bins, were resolved by picking randomly one of the most abundant states). EpiCSeg and ChromHMM, by contrast, were run using the binning scheme. All other settings are as reported in the main document.
We dealt with the runtime problem by running all segmentation algorithms only on chromosome 21. ChromHMM and EpiCSeg perform training and prediction as a single task, while with Segway we first used the 'TRAIN' task, and then the 'IDENTIFY' task. However, Segway runtime was still too large to allow for training on the whole chromosome. We restricted the training task on the region chr21:37313025-38619844 (zero-based, left-inclusive, right-exclusive coordinates), which is about 1.3 megabases large (2.7% of the whole chromosome) and gene-rich (but also with some geneempty subregions). Even using these settings Segway was about 400 times slower than EpiCSeg and ChromHMM (see Figure 6).
In Figure 7 we report the results of the comparison using the performance scores described in the main document. The results show that Segway's performance is, in general, considerably lower compared to ChromHMM and EpiCSeg. As an exception, Segway achieves a high precision in TSS prediction task, however this is also accompanied by a low sensitivity. From Figure 8 it can be seen how Segway tends to assign most of the base pairs to a few states. It can also be seen that the promoter-associated state in Segway seems to be affected by the nucleosome pattern expected at promoters, which consists of a nucleosome-depleted region right before the TSS and positioned nuclesomes after and before. According to a view of chromatin states others than the one implicitly adopted in this paper, this might be a desirable behaviour. Segway, the states have been identified at a single base pair resolution, but because the resulting state distribution around transcripts was very noisy, the curves shown above have been smoothed using a running average with a window size of 17 base pairs.

Replicate similarity
We computed how similar the datasets K562_1 and K562_2 are by measuring the Pearson correlation coefficient and the total number of reads ( Figure 9) per replicate pair. Note that these statistics rely on the same binning scheme used for the segmentation, i.e. each sample is considered as a vector containing the read counts per genomic bin. The first plot shows that the experiments are indeed strongly correlated, suggesting that they are a reflection of the same biological processes. However, there are considerable differences in sequencing coverage.  On the right, the total read count for each replicate pair. In both plots each replicate is transformed to a vector containing the read counts per bin.

Error rates vs read coverage
We studied how the disagreement between segmentations from replicate datasets depends on the total read count per bin. As a result of the robustness analysis presented in the main document, for each algorithm and for each genomic bin in a certain chromosome we have a match or a mismatch depending on whether for that chromosome the segmentations from the K562_1 and K562_1 datasets differ in that bin. Next, bins are grouped according to the total read count across marks (computed from the K562_1 dataset), so that each group consists of at least 500 bins. The mismatch ratio, or error rate, is the ratio of bins in a group where a mismatch occurs. Figure 10 shows how the error rates depend on the read counts. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q The plots at the top show how the error rates depend on the read counts. Mismatches between bins result from the robustness analysis described in the main document. Mismatch ratios, or error rates, are the ratios of bins in a group where a mismatch occurs. Groups are defined by the total read count across marks in each bin, so that each group consists of at least 500 bins. At the bottom, two identical histograms showing how many bins belong to each group.

EpiCSeg's model robustness to shifts of the binning offset
The average Jaccard index is an algorithm-independent quantitative measurement of the similarity between two segmentations which is based on finding a correspondence between states. Here,

Genomic distribution of the chromatin states
In the Figures 12, 13, 14 and 15 we report the summary statistics described in the main document for all datasets.