A multi-task graph-clustering approach for chromosome conformation capture data sets identifies conserved modules of chromosomal interactions

Chromosome conformation capture methods are being increasingly used to study three-dimensional genome architecture in multiple cell types and species. An important challenge is to examine changes in three-dimensional architecture across cell types and species. We present Arboretum-Hi-C, a multi-task spectral clustering method, to identify common and context-specific aspects of genome architecture. Compared to standard clustering, Arboretum-Hi-C produced more biologically consistent patterns of conservation. Most clusters are conserved and enriched for either high- or low-activity genomic signals. Most genomic regions diverge between clusters with similar chromatin state except for a few that are associated with lamina-associated domains and open chromatin. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-0962-8) contains supplementary material, which is available to authorized users.


plementary Methods


Comparing cluster assignments at different bin sizes
and bin definitions

Given two sets of cluster assignments, C 1 and C 2 of regions of potentially different lengths, our goal is to calculate the overlap of the two sets.Let C 1 i denote the i th cluster of the first set of assignments and let C 2 j denote the j th cluster from the second set of assignments.Note that there is no correspondence between C 1 i and C 2 i .For every pair of clusters C 1 i and C 2 j define overlap between j overlap(r, r ),
where r and r are regions and overlap(r, r ) is the number of base pairs that overlap bet een two regions.

We map each C 1 i to a cluster C 2 i * where i * = arg max j overlap(C 1 i , C 2 j ), is the cluster in the second set with the largest overlap.We iterate over all clusters in the first cluster assignment and sum over the length of the best overlaps (o 12 = i overlap(C 1 i , C 2 i * )).The overlap ratio is defined as o 12 l 1 where o 12 is the length of the overlap (base pair) and l 1 is the length of the genome covered by the first cluster a signment (bp).

We calculate this ratio from C 1 to C 2 and vice versa.The similarity between two cluster assignments is then defined as the average of the two ratios.To test the significance of the computed overlap, we compared it to the overlap between a clustering to a random shuffling of itself.The overlap with the random clusters is dependent on the number of clusters and size of the clusters, but in all cases, overlap estimated by a random clustering is significantly lower than the estimated overlap between two estimated clusterings (Supplementary Table 1).For example the overlap observed between trans and cis and trans interactions of human ESC at 1Mbp resolution is 59%, while the overlap between the cis and trans clusters and a random shuffling of itself is 24±0.5%.To calculate the F-score between a pair of clusters C 1 i and C 2 j (shown in Supplementary Figures S3, S7, S8, S12, S13), we calculate the overlap of the two clusters as described before
(o = overlap(C 1 i , C2

Clustering TAD regions

We start by aggregating normalized contact counts at 40Kbp resolution

n pairs of TAD regions.
f bin i is inside T AD k and bin j is inside T AD l , we add the normalized contact count of the pair (bin i , bin j ) to pair of TAD regions (T AD k , T AD l ).Because the length of TADs are variable (160Kb-5Mb), we normalized the aggregated sums by dividing by the size of the TADs; if the length T AD k is l k and T AD l is l l , we divide the aggregated sum of (T AD k , T AD l ) by l k × l l .We next clustered the matrix of the aggregated sums using spectral clustering (with Spearman correlation as the graph edge weights).Figure S7 shows the resulting clusters.Figure S7 a-e shows the heatmap of Spearman correlation matrix of the aggregated sums, re-ordered by the spectral clusters, the distribution of chromosomes in clusters, patterns of enrichment of different regulatory signals, and the significance of the enrichment in the clusters (similar to Figure 2).We next compared the clustering of TADs to the clustering of fixed size bins at 1Mbp resolution and calculated their similarity (as described above).Clustering using TADs and 1Mbp bins have a 43% overlap, which is significantly higher than random.Furthermore, for 4 of the 10 clusters we can uniquely map the clusters from TADs to clusters using 1Mbp bins Figure S7f.


Clustering inter-chromosomal (trans) interactions only

To create the adjacency matrix, A using inter

hromosomal interactions only, we calculated the Spearma
correlation of interaction profile of regions, after excluding the intra-chromosomal (cis) interactions.For a pair of regions r 1 and r 2 , where r 1 is in chromosome i and r 2 is in chromosome j, we define their trans correlation, corr trans (r 1 , r 2 ), as the Spearman correlation between two vectors x r 1 and x r 2 .Each entry

x r 1 (t) corresponds to all regions t that are not in chromosomes i or j.Each entry of the adjacency ma rix is then defined as
A r 1 ,r 2 =                0 if chr(r 1 ) = chr(r 2 ) 0 if corr trans (r 1 , r 2 ) < 0 cor region r.We then perform spectral clustering on A as described before.Supplementary Figure S3a,b,c shows the heatmap of the adjacency matrix, and the enrichment of different regulatory signals in these clusters for human ES cell line.We find that these clusters exhibit similar patterns of enrichment as the clusters produced when using both inter and intrachromosomal interactions (Figure 2).Supplementary Figure S3d, e shows the distribution of chromosome in clusters, and the distribution of clusters in chromosomes.6 out of 10 clusters consist of regions from different chromosomes, 3 of these 6 are enriched with different regulatory signals associated with open chromatin, and 3 are enriched for LADs and are associated with closed chromatin state.We also observe that all chromosomes are split into more than 1 cluster.Supplementary Figure S3f shows the overlap of these clusters to clusters obtained using both inter and intra chromosomal interactions.We observe significant overlap (59%) between the two clusterings, suggesting that at this resolution, we can recover similar structures from both trans and cis and trans interactions.


Statistical significance of cluster evaluation values to assess over-clustering

Because several clusters exhibited

imilar patterns of enrichment that could be grouped into high and low activity r
gions, we tested whether k = 10 was overclustering the data, by computing the expected value of the evaluation criteria in a random setting.Specifically, to evaluate whether the scores that we observe from different evaluation metrics are significant, we compare these scores to random cluster assignments (Supplementary Figure S5).We generated 100 random cluster assignments by shuffling the cluster assignments from spectral (Spearman> 0) clustering.Davies-Bouldin index, Delta-contact counts, and silhouette index for the actual clustering result is significantly better (red star in Supplementary Figure S5) than random cluster assignments.As an additional sanity check of our clustering, we compared the distribution of interaction strengths within and between clusters for every pair of clusters (Supplementary Figure S5b,   c).Supplementary Figure S5b shows the distribution of positive Spearman correlation within regions of clusters C0 and C1, and between regions in clusters C0 and C1.We observe that regions in each cluster are significantly more correlated to themselves compared to regions between two clusters.Supplementary Figure S5c shows the the distribution of correlation between and within clusters, for all pairs of clusters.We observe significant difference in between and within correlations for all pairs of clusters (KS-test p-value < 1E − 20).


Assessing the dependencies between different regulatory features

The regulatory genomic signals that we used for assessi

ichment in the clusters were highly correlated (Supplementary Fig
re S4a, b).Because previously open chromatin assays measured through DNase I footprinting has been a strong predictor of interacting and non-interacting regions, it is possible that the enrichment patterns that we observed were a consequence of their association of DNase I, and not due to the clustering of the Hi-C data.To test whether the clustering is providing additional information over DNase I for the other non-DNase I features we computed the conditional mutual information between two random variables, given the DNase I signal.One random variable represented the cluster partition and the second random variable represented one of the non-DNase I signals.Before calculating the conditional mutual information, we first discretize each feature by clustering it into 3 clusters (using Matlab's kmeans).To assess the significance of the conditional mutual information for a feature and a clustering, we performed a permutation test.We shuffled the cluster assignments for 1000 times and calculated mean and STD of the mutual information values for the 1000 random cluster assignments.We calculated the significance of the conditional mutual information by comparing the actual value to the mean and STD from random permutations (using Matlab's ztest).Supplementary Figure S4c-g shows the conditional mutual information of different features and the results of different spectral clustering methods, given the DNaseI feature vector, and the corresponding random value.Even though different regulatory features are significantly correlated, we observe that the enrichment of these features in the resulting clusters can not be explained using DNase I alone and the clusters are providing additional information.


Arboretum-HiC algorithm details

Given the Hi-C contact count matrices of M different cell line/species, and a mapping between the reg

ns in these cell lines and/or sp
cies, the goal of Arboretum-Hi-C is to cluster the contact counts.Between cell lines of the same species, the regions are identical.Between two species, we define the mapping using orthology of the genomic bins (as described in the manuscript methods).The original Arboretum algorithm was developed to cluster multiple expression clusters, one for each species.Below we describe the key differences in the initialization and learning algorithm due to the nature of the Hi-C data.


Initialization

To cluster Hi-C contact count matrices using the spectral clustering algorithm, we computed the top K eigenvectors of the

aph Laplacian a
d apply Gaussian mixture model on each set of eigenvectors.Here K denotes the number of clusters in each dataset.Let X i denote the set n × k matrix of the K largest eigenvectors where each column corresponds to an eigenvector.We merge this matrix into
X = [X 1 , • • • , X M ].
We next applied the k-means algorithm to this merged matrix with 50 random restarts and pick the clustering with the best objective.These cluster assignments are projected onto each Hi-C dataset to estimate the initial means and diagonal covariance elements of each Gaussian mixture.There is one mixture per leaf node.


Learning the Arboretum-Hi-C model

The parameters of Arboretum-Hi-C are the transition matrices for each branch in the tree, the prior probabil

y of each cluster, and the Gaussia
mixture model associated with each leaf node which has the measured dataset.As in Arboretum, we use an Expectation Maximization (EM) framework to cluster the data: in the E step the hidden cluster assignments are inferred, and in the M step the Gaussian mixture model parameters and the transition parameters are estimated.The EM algorithm is very similar to the original Arboretum algorithm, with the key difference that we are applying it to eigenvectors rather than expression values.

Expectation: Let z k ri denote the indicator variable that region i in the last common ancestor, r (the root of the phylogenetic tree) is in clust r k.Let z k|k ti denote the indicator variable that region i is assigned to module k in node t, while in its immediate ancestor it was assigned to k .Let γ k|k ti be the posterior probability of this event.Let P t (k|k ) denote the transition probability of a region to switch to cluster k given that it was in cluster k in t's immediate ancestor.We have a P t (k|k ) for each branch of the tree.Let
{µ t 1 , • • • , µ t k , Σ t 1 , • • • , Σ t k )
} denote the mixture Gaussian parameters when t corresponds to a leaf node.We assume that module assignment of region i in node t is only dependent on the sub-tree rooted at t.Given this independent assumption, we can infer these posterior probabilities in a recursive manner.We start at a leaf node t and estimate the γ variables based on the row, x t i of the eigenvector matrix X t corresponding to region i and the Gaussian mixture parameters with this node:
γ k|k ti = P t (k|k )P (x t i |µ t k , Σ t k ) P (x t i |µ t k , σ t k )
is the probability of observing x t i from a Gaussian with parameters µ t k and Σ t(t) (k)n right(t) (k)
where lef t(t) and right(t) are respectively left child and right child of node t, and
n t (k) = k γ k |k t t(t) (k) and n right(t) (k) sum over all possible cluster assignments of region i in children of t.

Maximization: The maximization step of the inference is very similar to standard Gaussian mixture models.For module k in node t we have:
µ tk = i K k =1 γ k, t x t i i K k =1 γ k,k t where γ k,k t
is the expected value of joint assignment to module k in t and to module k in immediate ancestor o ihood:

The penalized likelihood of the model is defined as the data likelihood (the combination of likelihood of mixture of Gaussians in leaf nodes and ransition probabilities from parent nodes to child nodes) penalized by the number of parameters: L − np 2log(no) where L is the data likelihood, n p is the number of free parameters, and n o is the number of regions.For a simple mixture of Gaussian the number of parameters would be 2KT (K clusters with a mean and diagonal covariance matrix with T elements corresponding to T measurements in the input data).We note that in the spectral clustering framework, K = T and therefore the number of parameters is 2K 2 .In Arboretum additionally we will have K(K − 1) parameters for the transition matrices in leaf nodes and internal nodes, and K − 1 parameters in root node for the initial prior probabilities of the modules.So the number of parameters in the complete model will add to n p = (2K 2 + K(K − 1))S e + (K(K − 1))S a + (K − 1) where S e correspond to the number of leaf nodes (cell line/species) and S a correspond to the number of internal nodes excluding the root node.


Arboretum-Hi-C clusters at higher resolution

We applied Arboretum-Hi-C to both regions of 1Mbp and 500Kbp size.Supplementary Figure S12a,b show the results of

boretum-Hi-C at 500Kbp resolution.Normalized
ontact counts were aggregated at 500Kbp resolution.We defined the orthology map between the 500 bp regions as described before for 1Mbp resolution maps.We observe similar patterns of enrichment as we observed in the 1Mbp resolution clusters.Supplementary Figure S12c-e shows the percentage of overlap between clusters recovered at 500Kbp and 1Mbp resolutions, and we observe significant overlap between the clusters in all cell lines.


Arboretum-Hi-C clusters for inter-chromosomal interactions

We also applied Arboretum-Hi-C to trans-only interactions in a manner similar to the single task spectral

ustering, but excluding any interaction that is in cis in e
ther human or mouse.We considered only those regions that have an ortholog in human and mouse.As before, to define the trans correlation of two regions, r 1 and r 2 , we first exclude all regions, r that are in the same chromosome as r 1 or r 2 or if r 's ortholog is in the same chromosome as r 1 's ortholog or r 2 's ortholog in the other species.We compute the Spearman correlation between these two vectors, corr trans (r 1 , r 2 ).

We then define our adjacency matrix as:
A r 1 ,r 2 =                0 if chr h (r 1 ) = chr h (r 2 ) ∨ chr m (r 1 ) = chr m (r 2 ) 0 if corr trans (r 1 , r 2 ) < 0 corr trans (r 1 , r 2 ) otherwis hylogenetic tree structure on Arboretum-Hi-C

To test the effect of the structure of the phylogenetic tree on the results of Arboretum-Hi- , we tested two alternative tree structures, one where the cell lines from the same species are siblings (tree1) and one where ES cell lines are siblings (tree2, Supplementary Figure S11a).We observed that the likelihood of the model is significantly higher when using tree1 (Supplementary Figure S11b).We furthermore investigated the conservation scores between the pairs of cell lines when using different structures of phylogenetic tree (Supplementary Figure S11c).We found that even though using tree2 structure decreases the within species conservation scores (hESC vs. hIMR90 and mESC vs. mCortex), these scores are still significantly higher than the between species conservation scores, suggesting that for these cell lines, Hi-C data from the same species are more similar to each other than the Hi-C datasets from matched cellular stages.


Assessing the ability of clustering methods to recover TADs

We examined the ability of spectral and other clustering methods to recover TADs.We first examined a small porti

of chromosome 1 of mESC (3Mb-9.5Mbat 40Kbp resolution).Supp
ementary Figure S10a shows the log2 of contact count for these regions, and Supplementary Figure S10b highlights the clusters recovered by different methods.Even though there are some disagreement between the clusters from different methods, we observe significant overlap between

ese clusters and the reported TADs.These results

suggest that c
ustering could be a useful approach in recovering different topological structures.

Supplementary Figure S10c shows a graph representation of these regions.Edges correspond to top 10% of contact counts between the regions, and nodes are coloured to highlight different TADs (as reported by Dixon et al. [1]).We repeated our TAD analysis for all the chromosomes by extracting the contact count matrix corresponding to the reported TADs in each chromosome, and clustering this matrix using the different methods (k = number of TADs in the chromosome).Because using global Spearman correlation can wash out the correlation information associated with the smaller bin size of 40Kbp, here we only used contact counts and log2 of contact counts.Supplementary Figure S10d shows the percentage overlap of different method and the TADs.We observe significant agreement between different clustering methods and the reported TADs in all chromosomes.


Assessing ability of clustering methods to recover compartments

To assess the ability of clustering methods in recovering compartments, we first applied our different clustering method on the chromosome 1 of hESC.To define the compartments, we used the procedure described in Lieberman-Aiden et al. [2].The raw contact counts between regions r 1 and r 2 were normalized by divid-ing it by the average contact count at distance of dist(r 1 , r 2 ).Different clustering methods were applied to this normalized matrix, and compartments were defined by positive and negative portions of the first principal component of the Spearman correlation of the normalized contact count.Supplementary Figure S10e shows the heatmap of the Spearman correlation of the normalized matrix, and Supplementary Figure S10f highlights the clusters recovered by different methods.We observe that most of the clusterings have significant overlap with the compartments, though the methods that use correlations distance (Spearman or Pearson) have better success in recovering the compartments, which is expected since the compartments were defined using the correlation of the normalized contact count matrix.We repeated our clustering for all the chromosomes in the human ESC dataset and compared the overlap of compartments to clusters obtained from different clustering methods.Because clustering of Spearman correlation had the best agreement with the compartments in chromosome 1, here we only compared different clustering methods using Spearman correlation as similarity measure.Supplementary Figure S10g shows the percentage overlap of different method and the compartments.We observe significant agreement between different clustering methods and the recovered compartments in all chromosomes.

2 Supplementary Figures       In the first row of each panel, the first plot on the left is the heatmap of Spearman correlation matrix of contact counts, reordered according to the spectral cluster assignments.The second plot is the enrichment of different regulatory features reordered according to the cluster assignments.The third plot on the right is -log10 of KS-test p-values.The first plot in the second row is the distribution of chromosomes in each cluster and the last plot is the distribution of clusters in each chromosome.In the first row of each panel, the leftmost plot is the heatmap of Spearman correlation matrix of contact counts, reordered according to the spectral cluster assignments.The second plot is the enrichment of different regulatory features reordered according to the cluster assignments.The rightmost plot is the -log10 of KS-test p-values.The first plot in the second row is the distribution of chromosomes in each cluster and the last plot is the distribution of clusters in each chromosome.The exceptions are chromosomes 16, and 19-22, where the whole chromosome was put in one cluster.).There is a significant change in the distribution number of peaks between diverged regions and regions of clusters 1 and 9 for CTCF, RAD21 and CEBPB while the difference in distribution for MAFK and POLR2A is not as significant.


List of Figures


Distribution of chromosomes in clusters Distribution of clusters in chromosomes
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX

Distribution of chromosomes in clusters Distribution of clusters in chromosomes
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrXchr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrXC0 C1 C2 C3 C4 C5 C6 C7 C8 C9 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX

Spearman correlation of contact counts