Spectacle: fast chromatin state annotation using spectral learning

Epigenomic data from ENCODE can be used to associate specific combinations of chromatin marks with regulatory elements in the human genome. Hidden Markov models and the expectation-maximization (EM) algorithm are often used to analyze epigenomic data. However, the EM algorithm can have overfitting problems in data sets where the chromatin states show high class-imbalance and it is often slow to converge. Here we use spectral learning instead of EM and find that our software Spectacle overcame these problems. Furthermore, Spectacle is able to find enhancer subtypes not found by ChromHMM but strongly enriched in GWAS SNPs. Spectacle is available at https://github.com/jiminsong/Spectacle. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0598-0) contains supplementary material, which is available to authorized users.


Overview of our spectral learning approach
Spectral learning is a novel approach to parameter estimation for latent variable models (e.g. Hidden Markov Models, topic models, etc.) that relies on a Methods of Moments approach instead of Maximum Likelihood. In the Method of Moments approach, we express various expected moments of the model (e.g. mean, covariance etc.) as functions of the parameters, set the expected moments equal to the sample moments estimated from the data, and solve the resulting equations for the parameters. Spectral learning specializes this approach for latent variable models by making an additional mild assumption that various parameter matrices are fullrank. For example, in the case of HMMs, spectral learning is provably correct (i.e. it recovers the true parameters of the model) assuming that the transition and emission matrices of the HMM have rank equal to the number of chromatin states and that the initial state vector is strictly greater than 0 in all coordinates. One can also prove polynomial sample complexity bounds for the method. These assumptions are considered mild because of the general view that learning is not meaningful if the assumption is false -for example, if there is one emission probability vector for a chromatin state that is a linear combination of two others, or even identical to another chromatin state, that would be generally considered a degenerate model. In practical applications, spectral learning can be much faster than the EM algorithm since the main computation is simply to compute the singular value decomposition (SVD) of the second moment matrix. Unlike EM, spectral learning does not suffer from local optima issues and does not depend on the parameter initialization.
There are many open theoretical issues in spectral learning that the machine learning community is currently working on. For example, the problem of how to best include regularization into the spectral learning framework is a major open question. Beyond these theoretical problems, not many practical attempts have been made to implement and test this class of methods in any applied area. There are a few implementations of spectral learning in the natural language processing and computer vision communities, but these data types have different characteristics to genomics data. For example, in natural language processing, one often has access to many short samples whereas in genomics we have one very long sample (i.e. the genome). In the context of computational biology, to the best of our knowledge there are only two previous implementations which were cited in the Discussion section of the main manuscript. However, both of them were for very specific applications, one as a string kernel method within a discriminative classifier and one for contrasting a foreground and background data set in cases where this distinction makes biological sense.
The overall method of Spectacle is based on the method of [1] who learned the HMM parameters only up to an unknown invertible transformation. The transformed parameters are still useful for computing the likelihood of the data and predicting future observations (it is also called the "Observable Operator Model") but cannot be used to annotate the hidden chromatin states. Mossel and Roch showed how to infer the HMM parameters explicitly for the special case where the state and observation spaces have the same dimension [2]. [1] extended their method to more general HMMs. However, the published method can be unstable if the eigenvalues corresponding to the chromatin states are not well separated and the method relies on injecting noise into the eigenvector computation to spread the eigenvalues [1].
In this work we have improved the method in a deterministic and principled way to infer the (untransformed) HMM parameters by using a single major observation for each chromatin state to compute the eigenvectors. In addition, [1] assumed that we have access to many short, independent samples from the HMM and we have modified their method to our application where we have a few long samples from the HMM by rewriting the formula for the transition matrix parameters in a form that does not depend on the initial state probability distribution. Finally, we have contributed empirically developed solutions to deal with noise for our application of chromatin state annotation. In our experiments, we made a novel observation that spectral learning tends to be more robust on class imbalanced data than the maximum likelihood approach. We believe this property is of significant interest for follow-up theoretical studies.
2 Comparison of the time complexity of spectral learning and the EM algorithm The training time of one iteration of EM is linear in the number of genomic segments and the number of chromatin mark combinations, and quadratic in the number of chromatin states. On the other hand, the training time of spectral learning is dominated by the computation time of the SVD which is cubic in the number of observations if all singular values are computed exactly (this is the running time of a practical implementation -theoretically the cubic dependence can be improved to the exponent of matrix multiplication [3] which is currently < 2.373 [4]). In fact, the practical running time is significantly faster because we only need to compute the top K singular vectors corresponding to the K chromatin states and typically K << N . Conversely, the training time of spectral learning depends exponentially on the number of histone marks if there is no redundancy between histone marks and all possible combinations of histone marks exist in the sample data. The total number of known histone marks is currently over 130 [5] but in practice the number of histone marks assayed in a single cell type is currently much smaller due to the expense and difficulty of the experiments. For example, there are less than 12 and 30 epigenetic marks in the ENCODE and Roadmap Epigenomics data, respectively, and there are less than 8 histone mark data for the majority of cell types. Furthermore, in practice, the number of observed combinations of chromatin marks is actually far smaller than the theoretically maximum number of combinations of chromatin marks, growing only as a small polynomial in the number of chromatin marks as opposed to exponential. Thus standard sparse linear algebra approaches are very efficient when applying spectral learning in the context of chromatin state annotation (see Main Text). Also, from a biological standpoint it is important in our view to study all possible combinations of chromatin marks to discover any possible interactions among the marks, rather than assuming that the marks are independent, as in other methods.

Analysis of the number of chromatin mark combinations
To analyze the actual number of combinations of chromatin marks that appear in real human epigenomics data, we used the following 16 chromatin marks in GM12878 -CTCF, H2AZ, H3K4me1/2/3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H3K79me2, H4K20me1, P300, Pol2, DNase 1 hypersensitive sites and FAIRE. We found that the number of combinations of chromatin marks was a low-order polynomial (degree 1 to 2) of the number of marks (Figure S1). The numbers of pairs and triples of combinations of chromatin marks were also low-order polynomials in the number of chromatin marks ( Figures S2,S3).

Runtime of Spectacle with larger numbers of epigenetic marks
In order to run Spectacle with a large number of chromatin marks, we used the version of our method implemented in Python using the SciPy sparse matrix library. We ran Spectacle and ChromHMM using 29 chromatin marks for embryonic stem cell H9 (new EID E008) from the Roadmap Epigenomics project. In theory, there can be up to 2 29 combinations of chromatin marks, but in fact we observed that there were only 186,814 (∼0.03%) combinations appearing in the genome. After eliminating singleton combinations that are likely due to noise, there were only 56,750 (∼0.01%) combinations that appeared at least twice in the genome. Spectacle took ∼43.2 min after removing singleton combinations. On the other hand, it took ∼8.1 hr to run ChromHMM for 200 iterations without reaching convergence. We also note again that Spectacle considers all possible combinations of epigenetic marks while ChromHMM makes a conditional independence assumption between epigenetic marks.
3 Performance of spectral learning on a data set with more balanced classes We tested Spectacle on a different binarization of the ENCODE ChIP-seq reads using the Scripture program [6]. Scripture produces broad peak calls for ChIPseq reads, which is most appropriate for some histone marks such as H3K36me3 and H3K9me3 [7]. For most other known histone marks, sharp peak calls are more appropriate. Consistent with this, the Scripture peak calls covered 81-97% of the peaks called by the Poisson binarization method and we expect that the additional Scripture peak calls likely include some false positive peaks. Thus, the Scripture data is noisier and more balanced between different classes (∼52% of the genome was in the null state in Table 1 instead of ∼90%). Consistent with our observations about class imbalance, our experiments on the Scripture data showed that the maximum likelihood criterion is a good criterion in the class balanced case in the sense that improvement in chromatin state annotation (as judged by overlap with biological features) corresponded to higher likelihoods. On this data set, we thus tested the possibility of using Spectacle as an initializer to the EM algorithm, which is another common use of Method of Moments estimators (e.g. [8,9]).
We first benchmarked the performance of the EM algorithm with either random parameter initialization or a published heuristic initializer. For 15 chromatin states, we ran the EM algorithm to convergence for 10 random parameter initializations for the GM12878 cell line. The log-likelihood varied from −8.33E6 to −1.11E7 (average: −9.51E6, std. dev.: 7.67E5) and the number of iterations needed for the likelihood to converge varied from 16 to 200. We found that the highest log-likelihood from the random initializations was lower than that of the heuristic initialization method, information, which was −7.81E6. We thus used the information heuristic as a baseline for further comparisons.
We compared Spectacle with a few (up to 5) iterations of EM with ChromHMM with the information initializer in terms of the likelihood of the data for an EN-CODE Tier 1 cell line GM12878. For almost all numbers of states we tested, Spectacle had a higher likelihood than ChromHMM with the information initializer (Table  S1). For the case of 100 states, the difference in the performance of the two methods was rather large and the likelihood found by ChromHMM was lower than that for fewer chromatin states, suggesting that the EM algorithm found a local optimum for this particular number of states. Since ChromHMM took 20-200 EM iterations to converge to a local optimum in the likelihood, we next ran EM for only five iterations to make its runtime comparable to the spectral learning approach. In this case we found that spectral learning had a higher likelihood than ChromHMM with the information initializer for all numbers of states tested. Taken together, these results show that when the run times of the two methods were constrained to be similar, spectral learning always outperformed EM, and even when spectral learning took a much shorter runtime, it still generally performed better than EM as judged by the likelihood.
We then compared both methods over multiple cell types to assess the performance of Spectacle over a range of data sets. We examined ten cell types including GM12878 and ran spectral learning and ChromHMM with 15 hidden chromatin states. Spectral learning found higher likelihoods for 8 out of 10 cell types while also achieving a 2.6-12.6 fold (in CPU time) and 2.6-13.4 fold (in number of iterations) faster training time (Table S2). For 20 chromatin states, Spectral learning found higher likelihoods for 9 out of 10 cell types while also achieving a 2.6-12.6 fold (in CPU time) and 2.6-12.6 fold (in number of iterations) faster training time (Table S3). Finally we verified that the higher likelihoods indeed corresponded to improved precision-recall accuracy on a number of biological features (Figures S4-S13).
Taken together, from our testing on the Scripture data set, we found that Maximum Likelihood was a good criteria for class balanced epigenomic data. In this case Spectacle was useful as a parameter initialization method to the EM or other local search algorithm, in which case it improved on the run time, likelihood and precision-recall accuracy compared to the published "information" heuristic initializer. Although this approach does not appear to be required on current human or mammalian epigenomic data sets, we note that non-vertebrate species with more compact genomes are likely to have more class balanced epigenomic data sets because they have a higher density of genes and regulatory elements compared with non-regulatory non-coding regions so this approach may be useful for those data sets.
To compare the results of Spectacle with Segway, we downloaded the chromatin mark datasets used in Hoffman et al [10] and the genome segmentations produced by Segway when run on the datasets. We ran Spectacle with 25 chromatin states on the same datasets ( Figures S20,S21). The fraction of the genome that did not have any signal was ∼81%. Thus, it was still a class-imbalanced data set although the number of epigenetic marks (14 marks) was higher than the data set used in the main text (eight marks). Segway assigned 12 of the 25 chromatin states (i.e. about half of the chromatin states) as inactive states (Quies, Low, Repr), three states as promoter states and four states as enhancer states. On the other hand, Spectacle assigned only four states as inactive states, four states as promoter states, and nine states as stong and weak enhancer states. Consistent with our results with 20 chromatin states in the main text, GWAS SNPs from 62 out of 78 phenotypes and in particular 12 out of 14 autoimmune diseases were found to be most enriched in one of the strong enhancer states for Spectacle (Table S6). Thus, taken together, these results were consistent with our observations about the performance of spectral learning vs. the EM algorithm on class imbalanced data sets. Since Segway uses a graphical model quite similar to a standard Hidden Markov Model and uses a Maximum Likelihood framework for inference, we expected its performance to be more similar to that of ChromHMM than Spectacle.
We also compared our results with TreeHMM. In Figure 4 of the TreeHMM paper, it can be seen that 4 of the 18 chromatin states were assigned as "low signal" states and an additional state was assigned as a repetitive state. Thus we believe that the same issues of overfitting to the background null state also affect TreeHMM, consistent with the fact that TreeHMM also uses maximum likelihood and an approximation to the EM algorithm for estimating the model parameters. Due to the higher number of parameters devoted to the null states, TreeHMM produced fewer promoter and enhancer states than Spectacle (Figures 2,S14,S16). It is not possible to directly compare with the solution of Spectacle with 20 chromatin states because TreeHMM learned slightly fewer chromatin states (18) and it used the CTCF signal which was not included in our dataset. However, Spectacle found a higher number of strong enhancer sub-types that did not exist in the TreeHMM solution (e.g. state 7, 17 and 20 in GM12878 and state 15 and 20 in K562). Note that state 20 in GM12878 and state 20 in K562 were the enhancer sub-types with the highest GWAS SNP enrichment in our analyses in the main text. TreeHMM also produces an additional type of output, namely comparative chromatin state annotations between multiple cell types, as described in the main text. This output is not directly comparable to the output of our combined cell type analysis because TreeHMM only learns a single set of chromatin states for all the cell types whereas we learn combined chromatin states that can differ for each cell type. Also, TreeHMM produces different genomic segmentations for each cell type and comparison of specific chromatin states, such as enhancers, must be done as a post-processing step. In contrast, our approach produces a single consistent genomic segmentation of all cell types. Nonetheless, both methods are successful at revealing cell-type specific enhancers.
Spectacle considers all possible combinations of chromatin marks intead of assuming that they are conditionally independent of each other given the hidden state. In theory, this can increase the number of parameters exponentially. Fortunately, the actual number of combinations of chromatin marks that appear in the real human epigenomics data is much less than the possible number of combinations (refer to Supplementary Note 2). Still, the number of parameters is much higher than that in the model with the conditional independence assumption. Overfitting can be an issue if the model complexity is high and the dataset is relatively small. We addressed this issue by doing a cross-validation analysis. More specifically, for this analysis we used the combined dataset for three cell types with 23 chromatin marks, the same dataset used in our manuscript (Results section 2.6). We divided the dataset into training and test data sets 23 different times, each time holding out one chromosome as test data. We trained our model and estimated parameters on the training data by using our method Spectacle, predicted chromatin states on a test data from the estimated parameters by using the posterior decoding algorithm, and tested the predicted chromatin states by using independent validation datasets such as active TSSs and active enhancers. We found that precisions and recalls for predicting TSSs and enhancers from the cross-validation analysis were almost identical to those from the case where both training and test data are the whole dataset (Table S10). This demonstrates that overfitting is not a major concern in our data.

Author details
Table S1 Comparison of the log-likelihood of Spectacle with five EM iterations and ChromHMM when run up to convergence while varying the number of chromatin states. For each number of chromatin states, the method with higher log-likelihood appears in bold. Within the parentheses is the log-likelihood ratio between ChromHMM and Spectaclewhen both methods are run with only five EM iterations. This is for the GM12878 cell type with Scripture peak calls.   Num liver  14  GM12878  8  117  0  0  20  GM12878  5  65  2  0  26  GM12878  7  167  0  0  34  GM12878  3  39  0  0  39  GM12878  28  1432  11  0  44  GM12878  9  114  0  0  47  GM12878  30  844  1  0  48  GM12878  6  129  0  0  28  H1-hESC  5  67  0  1  38  H1-hESC  5  242  0  0  15  HepG2  7  140  0  0  36  HepG2  10  428  0  0  46  HepG2  9  231  0  0 Table S10 Precision and recall for predicting external validation datasets. Validation datasets are active Gencode TSS or distal P300 for three cell types. "Whole Genome" denotes that parameters were trained on the whole genome. CV (cross-validation) denotes that parameters were trained on all of the genome except for one chromosome and chromatin states for the held out chromosome were predicted by using the trained parameters. In this way, chromatin states for the whole genome were predicted.  Figure S1 Growth in the number of combinations of marks (Y axis) as a function of the number of marks (X axis). Each red mark, its error bar, and blue mark for x marks is the average, standard deviation, and maximum, repectively, of the numbers of combinations of marks when choosing x out of the 16 marks. Black and green line is a polynomial of degree 1 and 2, respectively, that fits the blue marks.