Uniformly processed data reduces noise when learning conditional dependence
To ensure comparable signals across all ChIPseq data sets, we reprocessed raw ENCODE sequence data with a uniform pipeline (Fig. 2
a). We downloaded raw FASTQ files from the ENCODE Data Coordination Center [11, 15, 51] (Additional file 2) and mapped them using Bowtie2 [31] to the human genome reference assembly (build GRCh38/hg38) [19]. We binned mapped read start sites into 1000–basepair (bp) bins across the entire genome, which results in a 3,209,287×1451 data matrix X where genomic positions are viewed as samples (Fig. 2
a). We compared several different data preprocessing methods and chose binned read counts for three reasons:

1.
They allow easy integration of external ChIPseq experiments.

2.
They do not require the determination of various cutoffs in a peak calling algorithm.

3.
A network inferred from read counts performs well, revealing previously known protein–protein interactions (Additional file 1: Figure S1).
A conditionaldependence network can be efficiently learned from binned read count data
Learning a conditionaldependence network among thousands of ChIPseq data sets each containing millions of samples (genomic positions) requires an efficient algorithm (Fig. 2
b). It is well known that the nonzero pattern of the inverse covariance matrix of Gaussian random variables represents the conditionaldependence network [33, 37]. The inverse correlation matrix, Σ
^{−1}, is a normalized version of the inverse covariance matrix and also represents conditional dependence. A zero element ({Σ
^{−1}}_{
ij
}=0) means that the ith and jth variables are conditionally independent of each other given all other variables—they are not connected by an edge.
While it is common practice to learn the conditionaldependence network among continuousvalued variables based on the estimation of Σ
^{−1} [23], count data requires more care. Distributions of counts in binned ChIPseq reads are often clearly truncated at zero, and also increase in variance for high read counts. Multivariate distributions with countvalued marginal distributions are often very restrictive (for example only allowing positive correlations) or are infeasible to estimate for thousands of dimensions [62]. An often employed alternative is to use a multivariate Gaussian distribution after appropriately transforming the count data, such as with the sqrt or asinh function [9]. However, interestingly, our results show that applying a linear Gaussian model directly to the binned read counts of ENCODE ChIPseq data better recovers known protein–protein interactions than when using standard normalizing data transforms (“Methods” and Additional file 1: Figures S1 and S2). This leads to an efficient and simple model formulation for ChromNet applied directly to the mapped read counts, which is relatively easier to obtain compared to other ChIPseq data preprocessing methods and does not require any threshold.
ChromNet first computes the inverse sample correlation matrix
\(\hat \Sigma ^{1}\) from the data matrix X of 1451 variables and 3,209,287 samples, and then uses a GroupGM approach to interpret elements of \(\hat \Sigma ^{1}\) as weights of network edges (Fig. 2
b).
Group modeling mitigates the effects of redundancy
Many ENCODE ChIPseq data sets contain redundant positional information. Conventional conditionaldependence methods have a key limitation in modeling redundant data. If data sets A and A
^{′} are highly correlated, a conventional method would connect A with A
^{′} but connect A to the rest of the network only weakly (Fig. 1
c). Arbitrarily removing or merging redundant data sets can hide or eliminate important information in the data.
GroupGM overcomes challenges with redundant data in conditionaldependence models by allowing edges that connect groups of data sets (such as [ A,A
^{′}] and [ B,B
^{′}]). A group edge weight represents the total dependence between the variables in the two groups that the edge connects, and is computed from Σ
^{−1} as (“Methods”):
$$ G_{[A,A'][B,B']}=\Sigma_{AB}^{1}+\Sigma_{AB'}^{1}+\Sigma_{A'B}^{1}+\Sigma_{A'B'}^{1} \textrm{.} $$
An edge in a GroupGM model implies conditional dependence between the linked groups, but does not specify the involvement of individual factors in each group. We prove that GroupGM correctly reveals conditional dependencies in the presence of redundancy (Additional file 1: Supplementary Note 2).
A group is defined as a set of highly correlated variables whose individual conditionaldependence relationships with other variables are not likely to be captured, as illustrated in Fig. 1
c. To obtain groups, we used complete linkage hierarchical clustering, and restricted groups to have a minimum pairwise correlation of ρ (=0.8) within each group. The choice of completelinkage clustering allows us to obtain groups where all the factors are highly correlated. Because the completelinkage distance metric merges two clusters based on the minimum correlation between any two variables in the groups, we can stop merging when the minimum correlation becomes less than or equal to ρ before creating all 2p−1 groups, where p=1451.
Each variable (a ChIPseq data set) can be in multiple groups as long as it is highly correlated with at least one other data set. This multiscale nature of groups is a unique feature of the group graphical model. It allows us to capture multiple ways each factor can be connected with other factors. Say that a data set for factor A forms a group with another data set for factor B. In the group graphical model, A can have connections specific to itself and connections shared with B, and their edge weight values would indicate which connections are statistically robust. This allows us to reveal multiple kinds of interactions A can have: specifically with itself and with A and B as a complex. The latter may not be captured by a conventional conditionaldependence network, such as inverse correlation or partial correlation, if A and B are highly correlated with one another.
The purpose of having a threshold for minimum pairwise correlation ρ is to identify sets of variables whose high withingroup correlation is likely to prevent them from being connected to other variables in the network. The threshold used in this paper ρ=0.8 captures 53 % of all the multifactor groups formed by hierarchical clustering, and was chosen so as to include strong groups while still keeping the size of groups small enough to interpret (Additional file 1: Figure S3).
Conditional dependence and joint group modeling improve the recovery of known protein–protein interactions
To evaluate how conditional dependence and group modeling both contribute to the performance of ChromNet, we estimated three networks among ChIPseq data sets using the following three methods, where each method produces a set of weighted edges:

1.
Correlation: We learned a naive cooccurrence network, using a pairwise Pearson’s correlation between all pairs of data sets.

2.
Inverse correlation: We learned a conditionaldependence network, by computing the matrix inverse of the correlation matrix.

3.
GroupGM: We learned a group conditionaldependence network, which addresses tight correlation among data sets by allowing edges between groups of variables.
Partial correlation is similar to inverse correlation and performs nearly as well (Additional file 1: Figure S4). We did not include other previously described methods because they do not scale to the large data collection we used (Additional file 1: Supplementary Note 1 and Figure S5).
To assess the quality of the estimated networks, we identified the edges corresponding to published protein–protein interactions. As ground truth, we used the BioGRID database’s assessment of physical interactions between human proteins from experiments deemed low throughput [56]. For evaluation, we excluded edges connecting the same regulatory factor even when measured in different labs, cell types, or treatment conditions. These edges were excluded from evaluation to prevent them from artificially inflating the accuracy of the methods. We also excluded edges involving a histone mark because they do not exist in BioGRID. For these edges, we ran a separate evaluation using the HIstome database [26] and showed that the group graphical model shows higher enrichment than the alternative methods (Additional file 1: Figure S6). When we measured the conditional dependence between a pair of ChIPseq data sets in GroupGM, to avoid the inclusion of many redundant edges, for each pair of data sets, we picked the maximum edge weight out of all network edges connecting groups, each of which contains one of the corresponding data sets. This way, we consider exactly the same number of data set pairs for evaluation across all three methods. We only scored edges from groups containing a single type of factor (about half of the groups; see Additional file 1: Figure S3), because if a group contains more than one factor, there is no clear way to characterize such an edge as true or false from BioGRID, or match it with an edge from competing methods for comparison.
Group modeling improves the recovery of interactions within and between cell types
We compared the performance of the three methods described above across a range of prediction thresholds. For each network, we varied the number of evaluated edges N from 1 to the total number of edges. For each value of N, we identified the set of N edges with the largest weights. We also randomly picked N edges without regard to weight rank as a background set. We then calculated how many edges in each set matched known protein–protein interactions from BioGRID. We computed fold enrichment by dividing the number of matched edges in the prediction set by the expectation of the number in the background set. Since 8.4 % of data set pairs in the same cell type are supported by a BioGRID physical interaction, an enrichment fold of 1 corresponds to 8.4 % of recovered edges matching prior knowledge. Enrichment fold captures the effect of both type I and type II error rates (see “Methods”).
We first measured performance within all cell types, excluding edges between data sets in different cell types (Fig. 3
a top). Since the limited number of annotations in BioGRID imperfectly represents the human chromatin network, one cannot draw strong conclusions about absolute performance from this benchmark. The relative performance of the methods, however, is clear; inverse correlation performs better than correlation, and GroupGM outperforms inverse correlation. This supports the idea that better resolution of direct versus indirect interactions contributes to improved performance of inverse correlation over correlation, while greater robustness against redundancy likely contributes to improved performance of GroupGM over inverse correlation. The value of conditional dependence and group modeling is also further supported by specific examples in the network (Fig. 4; Additional file 1: Figures S7 and S8), and by the fact that GroupGM still outperforms inverse correlation even after attempting to remove the strongest redundancies by merging data sets from different labs targeting the same factor in the same cell type/condition (Additional file 1: Figure S9).
To assess the variability of the enrichment estimate, we performed bootstrap resampling of regulatory factor targets (Fig. 3
a, b, light curves). All data sets with the same factor are sampled together, leading to a conservative (high) estimated variability (“Methods”). GroupGM showed a statistically significant improvement over both correlation (P=0.0004) and inverse correlation (P=0.0036) for edges within cell types (Additional file 1: Figure S10).
To assess variability over cell types, we estimated enrichment separately for each cell type with 25 or more BioGRIDsupported edges. In each cell type, we identified the number N of BioGRIDsupported edges in that cell type. Then, we calculated the enrichment for BioGRIDsupported edges among the top N edges in that cell type (Fig. 3
c). GroupGM performed consistently better than correlation or inverse correlation in individual cell types (Additional file 1: Figure S11).
We also generated a simulated data set meant to mirror the characteristics of real ChIPseq data sets (Additional file 1: Figure S12). Using this simulated data, we found a similar relative performance of various methods, with GroupGM recovering the most true network edges (Additional file 1: Figure S13 and “Methods”).
To assess how well a joint model can recover relationships between factors measured in different cell types, we checked edges between different cell types for enrichment in known protein–protein interactions (Fig. 3
a bottom). The GroupGM network showed a clear enrichment for known interactions above random (P=0.0095), and also outperformed inverse correlation (P=0.0174) and correlation (P=0.0282) (Additional file 1: Figure S14 and “Methods”). This implies that information about many physical protein interactions can be recovered even from data sets in different cell types.
Comparison between a joint model of all cell types and celltype specific models
Integrating ChIPseq data sets from multiple cell types into a single network model provides the following three advantages. First, we can capture highlevel patterns in the joint chromatin network that would not otherwise be visible. Second, a joint model allows the direct comparison of celltype specific subnetworks because factors are conditioned on the same global set of ChIPseq data sets across all cell types. Finally, a data set for a regulatory factor in one cell type can serve as a proxy for a missing data set for that factor in another cell type, if the factor’s localization in the genome is conserved between the cell types (Additional file 1: Figure S15). This greatly expands potential chromatin network edges to include the union of regulatory factors measured in any cell type. This global network contains both conserved and celltype specific subnetworks, and proves useful in analyzing data from ENCODE, which only measures a few factors in some cell types.
To compare directly a joint model across all cell types with celltype specific models for each cell type separately, we focused on the four best characterized ENCODE cell types and compared enrichment of BioGRIDsupported edges (Fig. 3
b). By varying the number of edges in the networks, we find that the joint model consistently identifies interactions with higher fold enrichment for known interactions. In addition, a joint model also identifies more unique BioGRID supported protein–protein interactions than celltype specific models (Additional file 1: Figure S16).
We show as well that the large increase in potential edges from a joint model does not introduce spurious associations among edges within a cell type. When we excluded all crosscelltype edges from the joint model, the joint model still marginally outperforms celltype specific models (P=0.0672; Additional file 1: Figure S17).
An example of the importance of conditional dependence: SMC3 separates RAD21 and MXI1
A specific example illustrates how conditional dependence reveals experimentally supported direct interactions better than pairwise correlation (Fig. 4
a). In the correlation network among RAD21, SMC3, and MXI1, the three factors were tightly connected with one another in HeLaS3 cervical carcinoma cells. The conditionaldependence network, however, separated RAD21 and MXI1. This separation arose from the ability of SMC3 to explain away the correlation between RAD21 and MXI1. The factor pairs left connected in the conditionaldependence network, RAD21–SMC3 and SMC3–MXI1, have physical interactions described in BioGRID [21, 36]. BioGRID lacks any direct connection between RAD21 and MXI1. Panigrahi et al. discovered more than 200 RAD21 interactors using yeast twohybrid screening, immunoprecipitation–coupled mass spectrometry, and affinity pulldown assays [44]. They did not identify a RAD21–MXI1 interaction, which implies that RAD21 may not directly interact with MXI1.
To focus on the comparison between conditional dependence and correlation, we have not displayed the group that contains SMC3 and RAD21. This grouping reflects their common role in the cohesion complex and is present in many cell types. We also note that Fig. 4
a is only a small part of the full ChromNet network and considering more factors reveals additional relationships that involve CTCF and ZNF143, which is consistent with prior knowledge [67] (Additional file 1: Figure S18).
An example of the importance of group dependency: recovering a connection between H3K27me3 and H3K4me3
Another specific example shows how GroupGM mitigates the effect of redundancy on conventional conditionaldependence models. We examined edges between multiple H3K27me3 and H3K4me3 data sets from H7hESC embryonic stem cells, collected at different time points in differentiation [43]. H3K27me3 is a repressive mark and H3K4me3 is an activating mark. Since the data sets represent different portions of the differentiation process, one should not average them or pick a reference data set arbitrarily. However, the H3K27me3 data sets are correlated highly enough with one another to form a group, and so are the four H3K4me3 data sets. This implies that conventional conditionaldependence methods would identify edges between the two histone marks incorrectly.
Edges estimated using correlation indicate that the ChIPseq data sets targeting H3K27me3 and those targeting H3K4me3 are positively correlated. However, H3K27me3 is associated with repressed genomic regions while H3K4me3 is associated with actively transcribed regions [66]. Since a minority of promoters in embryonic stem cells are bivalently marked, these two marks should not have an overall positive association [5, 66]. In fact, most ChIPseq data sets are positively correlated with each other (Additional file 1: Figure S15), which is induced by mappability and many regions that are transcriptionally silent or active. Resolving this problem by removing some of these regions is unlikely to be successful, because it is not clear what criteria we need to exclude regions. In conditionaldependence models, such as inverse correlation and the group graphical model, by conditioning on many other variables, these global confounding effects are naturally removed. Edges estimated by inverse correlation account for these confounders but become weak and unstable showing a mixture of positive and negative associations (Fig. 4
b, middle). By allowing group edges, GroupGM has power to recover the negative association between H3K27me3 and H3K4me3 (Fig. 4
b, right), which is consistent with prior knowledge [5, 66].
An example of learning genomic context: ZNF143 mediates the conditionaldependence relationship between CTCF and SIX5
Many relationships between regulatory factors only occur in a particular genomic context. This raises the question of how, or whether, this context specificity is encoded in ChromNet. We can gain insight into this by considering what it means for one factor to mediate the relationship between two other factors, such as A mediating the relationship between C and D in Fig. 1
a. When this occurs, it means that the connection between C and D can be explained by their cooccurrence with A. In other words, A is the context in which the relationship between C and D occurs.
A practical example of this is found in the relationships between SIX5, ZNF143, and CTCF in the K562 cell type. Simple correlation connects all three factors together with positive edges, but GroupGM shows that ZNF143 actually mediates the relationship between SIX5 and CTCF (Additional file 1: Figures S7 and S8). This means that the association of SIX5 with CTCF primarily occurs in the presence of ZNF143, the CTCF–SIX5 relationship is contextspecific, and ZNF143 is the context. More generally, when an association between two factors, C and D, is specific to a certain genomic context and that context is well represented by a third factor, A, then A would mediate C and D. This gives the connections C–A–D in the conditionaldependence network; thus contextspecific relationships, such as the relationship between CTCF and SIX5 in the presence of ZNF143, are captured in a GroupGM network, if all three factors are present.
It is important to understand the genomic context in which any given edge occurs regardless of whether that context is well represented by another factor in the network. Even if A is not observed, we want to be able to infer the genomic context of the interaction between C and D. To address this need, we designed an efficient method to label every genomic position with its influence on a group network edge (“Methods”). Using CTCF–ZNF143–SIX5 as an example, we removed all ZNF143 experiments from ChromNet and then computed the genomic context of the edge between CTCF and SIX5. To validate this genomic context, we took the top 1000 bins (1,000,000 bp) and intersected them with the top 1000 bins from all other experiments in K562, including ZNF143. Even though ZNF143 was not present in the model and ZNF143 data sets were not used when inferring the genomic context, it had the highest overlap of any experiment with the context driving the CTCF–SIX5 edge, even higher than the CTCF and SIX5 experiments themselves (Additional file 1: Figure S19).
An example of network accuracy: recovered interactions with EZH2 in H1hESC recapitulate known functions
As an example illustrating the utility of ChromNet in revealing the potential interactors of a specific regulatory factor, we examined a small portion of the network associated with the wellcharacterized protein EZH2 (Fig. 4
c). We focused on the H1hESC cell type because it had many strong EZH2 connections in ChromNet. Examining connections to EZH2 in H1hESC highlighted several known interactions, which we discuss in decreasing order of edge strength. The strongest connection is from H3K27me3, and EZH2 is a methyltransferase involved in H3K27me3 maintenance [1]. The next strongest connections are with SUZ12, which is an essential part of the Polycomb repressive complex 2 (PRC2), and is required for EZH2’s methyltransferase activity [8, 13]. The next connection to CTBP2 is supported by this corepressor’s possible role in deacetylation of H3K27 in preparation for PRC2mediated methylation [28]. H3K4me3 is well known to be present in active regions of the genome, so a negative relationship with EZH2 (represented by a dashed line) that deposits the repressive H3K27me3 mark is expected. SP1 is a potentially novel interactor of EZH2, while TCF12 is known to coimmunoprecipitate with EZH2, which suggests that TCF12 interacts with PRC2 [35]. In summary, most of the strongest interactions with EZH2 have support in the literature. We found this mixture of interactions supported by the literature and potential novel connections in many parts of the network.
An example of crosscelltype comparison: enhancerassociated regulatory factors
Learning a conditionaldependence network for all ENCODE cell types allows the comparison of within cell type connections across different cell types. Active enhancers are known to be flanked by a combination of the histone marks, H3K27ac and H3K4me1 [53]. To quantify how strongly different transcription factors associate with active enhancers in different cell types, we calculated the sum of the group edges between each regulatory factor (except histone marks) and H3K27ac and H3K4me1 measured in that cell type. This provides a score for each factor in each cell type. Seven ENCODE cell types with 20 or more data sets contain both H3K27ac and H3K4me1, while also containing EP300, which is known to bind active enhancers [53]. We focused on these seven cell types and ranked the factors in each cell type by their association with H3K27ac and H3K4me1. Additional file 1: Table S2 lists the top ten factors in each cell type most associated with active enhancers. EP300 can be considered a validation for the list and is highly ranked in all seven cell types (P<10^{−5}). Interestingly, even more highly ranked than EP300 is POLR2A. This association is likely because active enhancers are in close proximity to active transcription start sites in promoters in 3D space, due to the looping mechanisms for enhancer–promoter communication. The influence that 3D conformation can have on measures of colocalization in the genome is important to bear in mind when analyzing the edges in ChromNet. Other factors that are consistently associated with enhancers across cell types are shown in red, while celltype specific associations are in black (Additional file 1: Table S2).
An example of a novel protein interaction: experimental validation of an interaction between MYC and HCFC1
The cMYC (MYC) transcription factor is frequently deregulated in a large number and wide variety of cancers [38, 47]. It heterodimerizes with its partner protein MAX to bind an estimated 10–15 % of the genome to regulate the gene expression programs of many biological processes, including cell growth, cell cycle progression, and oncogenesis [6, 38, 47]. The mechanisms by which MYC regulates these specific biological and oncogenic outcomes are not well understood. Interactions with additional coregulators are thought to modulate MYC’s binding specificity and transcriptional activity [22, 58]; however, only a few MYC interactors have been evaluated on a genomewide level. Analysis of the large number of ENCODE ChIPseq data sets can therefore further elucidate MYC interactions at the chromatin level.
ChromNet showed that MAX is the strongest interactor of MYC across multiple cell types (Additional file 1: Table S3), highlighting the ubiquitous nature of this interaction. Topscoring ChromNet connections also included other known MYC interactors, for example, components of the RNA polymerase II complex such as POLR2A and chromatinmodifying proteins such as EP300 (Additional file 1: Table S3). This shows how ChromNet can help identify protein complexes and interactions.
In addition to the known interactors described above, ChromNet also revealed previously uncharacterized, highscoring interactions, including the transcriptional regulator Host Cell Factor C1 (HCFC1) (Additional file 1: Table S3). HCFC1 binds largely to active promoters [39] and is involved in biological processes, such as cell cycle progression [46, 50] and oncogenesis [12, 45, 48]. This further supports its possible role as an interactor of MYC in regulating these activities. To validate the novel MYC–HCFC1 interaction, we performed a proximity ligation assay (PLA) in MCF10A mammary epithelial cells. This technique detects endogenous protein–protein interactions in intact cells [54] and has been used to validate novel interactors of MYC [18]. When two proteins that are probed with specific antibodies are within close proximity of each other, fluorescence signals are produced that are measured and quantified using fluorescence microscopy. We saw only background fluorescence when incubating with an antibody against MYC (Fig. 5
a, top) or HCFC1 (Fig. 5
a, middle) alone. Incubation with both MYC and HCFC1 antibodies yielded a significant increase in the fluorescence signal in the nuclear compartment (Fig. 5
a, bottom, Fig. 5
b and Additional file 1: Figure S20). This suggests that MYC and HCFC1 interact in the nucleus, and HCFC1 may be a novel coregulator of MYC. Future investigation will reveal the importance of HCFC1 in regulating the biological functions of MYC, such as cell cycle progression and oncogenesis. This discovery illustrates how ChromNet can suggest novel protein–protein interactions within chromatin complexes.
Spatial embedding reveals global patterns in the human chromatin network
By integrating all ENCODE data sets from many cell types into a single network, ChromNet enables extraction of global patterns in the relationships among regulatory factors. We used multidimensional scaling [7] to embed the entire network into a 2D layout (Fig. 6 and “Methods”). In this embedding, the spatial proximity of two nodes is designed to reflect their distance in the network, where positive edges pull nodes closer together and negative edges push them father apart. Nodes for the same regulatory factor in different cell types form a cluster when that factor’s genomic position is conserved across cell types. For example, CTCF forms a clear cluster in this manner (Fig. 6
a). Relationships between regulatory factors are represented by their proximity in the embedding. For example, MYC and MAX nodes are located in the same region; so are CTCF and RAD21. In contrast to the joint network, relationships in individual celltypespecific networks (Fig. 1
b top) are much less distinct (Additional file 1: Figure S21).
The relative positions of regulatory factors in the embedded graph highlight important aspects of biology. This is especially apparent among histone marks, where there is a clear separation between activating marks such as H3K4me3 and H3K27ac on the lower right and repressive marks such as H3K27me3 and H3K9me3 on the upper left (Fig. 6
a). H3K27me3 and H3K9me3 are both repressive marks, but form distinct clusters because they target distinct regions of the genome. H3K27me3 marks facultative heterochromatin, thought to regulate temporary repression of generich regions [27]. H3K9me3 marks constitutive heterochromatin, and acts as a more permanent repressor [30]. Between the active and repressive marks, we find H3K36me3 and H3K79me2. H3K36me3 is closer to the inactive marks and is implicated in restricting the spread of H3K27me3 [63]. H3K79me2 varies with the cell cycle and is associated with replication initiation sites [17]. The relative position of histones and protein factors is also interesting. ZNF274 has been implicated in the recruitment of methyltransferases for H3K9me3 and is found nearby in the network [16]. EZH2 is involved in the deposition of H3K27me3 and is found between the H3K27me3 cluster and the rest of the network [61].
The positions of regulatory factor data sets reflect both their celltype identities and association with chromatin states. Highlighting the three tier 1 ENCODE cell types shows a weak clustering of regulatory factor data sets by cell type (Fig. 6
b). K562 and GM12878 are both derived from blood cell lines and overlap spatially with one another in the network more than with the H1hESC human embryonic stem cells. Coloring the network by correlation with chromatin state also reveals spatial patterns. We chose five (out of seven) Segway [24, 64] annotation labels that highlight distinct areas of the network (Fig. 6
c), illustrating a clear separation between active and inactive regions of the genome, and that chromatin domains are reflected in the interactions of the chromatin network. Spatially embedding regulatory factor data sets using the ChromNet network simultaneously captures many important aspects of their function, such as chromatin state, cell lineage, and known factor–factor interactions.