Graph-regularized non-negative matrix factorization (NMF) and Clustering for Hi-C data (GRiNCH) framework
GRiNCH is based on a regularized version of non-negative matrix factorization (NMF) [36] that is applicable to high-dimensional chromosome conformation capture data such as Hi-C (Fig. 1). Below we describe the components of GRiNCH: NMF, graph regularization, and clustering for TAD identification.
Non-negative matrix factorization (NMF) and graph regularization
Non-negative matrix factorization is a popular dimensionality reduction method that aims to decompose a non-negative matrix, \(\textup {X} \in \mathbb {R}^{\left (n \times m\right)}_{\geq 0}\) into two lower dimensional non-negative matrices, \(\textup {U} \in \mathbb {R}^{\left (n \times k\right)}_{\geq 0}\) and \(\textup {V} \in \mathbb {R}^{\left (n \times k\right)}_{\geq 0}\), such that the product X∗=UVT, well approximates the original X. We refer to the U and V matrices as factors. Here k<<n,m is the rank of the factors and is user-specified.
In application of NMF to Hi-C data, we represent the Hi-C data for each chromosome as a symmetric matrix \(\textup {X} = \left [x_{ij}\right ] \in \mathbb {R}^{\left (n \times n\right)}\) where xij represents the contact count between region i and region j. We note that in the case of a symmetric matrix, U and V are the same or related by a scaling constant.
The goal of NMF is to minimize the following objective: \(\left \|\textup {X} - \textup {U}\textup {V}^{\top }\right \|^{2}_{\textup {F}}, \text {s.t.} \textup {U} \geq 0, \textup {V} \geq 0\) [33], where ||X||F indicates the Frobenius norm. A number of algorithms to optimize this objective have been proposed; here we used the multiplicative update algorithm, where the entries of U and V are updated in an alternating manner in each iteration:
$$ u_{ik} \leftarrow u_{ik} \frac{\left(\textup{X}\textup{V} \right)_{ik}}{\left(\textup{U}\textup{V}^{\top}\textup{V}\right)_{ik}},\ v_{jk} \leftarrow v_{jk} \frac{\left(\textup{X}^{\top}\textup{U}\right)_{jk}}{\left(\textup{V}\textup{U}^{\top}\textup{U}\right)_{jk}} $$
(1)
Here uik corresponds to the ith row of column U(:,k) and vjk corresponds to the jth row of column V(:,k).
Standard application of NMF to Hi-C data is ignorant of the strong distance dependence of the count matrix, that is, genomic regions that are close to each other tend to interact more with each other. To address this issue, we apply a constrained version of NMF with graph regularization, where the graph represents additional constraints on the row (and/or column) entities [36]. Graph regularization enables the learned columns of U and V to be smooth over the input graph. In our application of NMF to Hi-C data, we define a graph composed of genomic regions as nodes, with edges connecting neighboring regions in the linear chromosome, where the size of the neighborhood is an input parameter. Specifically, we define a symmetric nearest-neighbor graph, W:
$$ \textup{W}_{ij} = \left\{ \begin{array}{l} 1 \textup{, if } x_{i} \in N_{r}(x_{j}) \textup{ and } x_{j} \in N_{r}(x_{i}) \\ 0 \textup{, otherwise} \end{array} \right. $$
(2)
where Nr(xi) denotes r nearest neighbors in linear distance to region xi.
Graph regularized NMF has the following objective:
$$ \|\textup{X} - \textup{U}\textup{V}^{\top}\|^{2}_{\textup{F}} + \lambda \textup{Tr}\left(\textup{V}^{\top}\textup{L}\textup{V}\right) + \lambda \textup{Tr}\left(\textup{U}^{\top} \textup{L}\textup{U}\right), $$
(3)
where D is a diagonal matrix whose entries are column (or row, since W is symmetric) sums of W, i.e., \(\textup {D}_{ii} = \sum _{j} \textup {W}_{ij}\). L=D−W denotes the graph Laplacian and encodes the graph topology. The second and third terms are the regularization terms and measure the smoothness of U and V with respect to the graph. Here λ is the regularization hyperparameter. This new objective has the effect of encouraging the factors to be smooth on the local neighborhood defined by the graph. Accordingly, the multiplicative update rule from (1) gains regularization terms [36]:
$$ u_{ik} \leftarrow u_{ik} \frac{\left(\textup{X}\textup{V} + \lambda\textup{W}\textup{U}\right)_{ik}}{\left(\textup{U}\textup{V}^{\top}\textup{V} + \lambda\textup{D}\textup{U}\right)_{ik}},\ v_{jk} \leftarrow v_{jk} \frac{\left(\textup{X}^{\top}\textup{U} + \lambda\textup{W}\textup{V}\right)_{jk}}{\left(\textup{V}\textup{U}^{\top}\textup{U} + \lambda\textup{D}\textup{V}\right)_{jk}} $$
(4)
Both r (neighborhood radius) and λ are parameters that can be specified, with λ setting the strength of regularization (λ=0 makes this equivalent to basic NMF). See section on “Selecting GRiNCH hyper-parameters” below.
Chain-constrained k-medoids for clustering and TAD calling
The factors U (or V) can be used to extract clusters of the row (or column) entities of the input matrix. When X is symmetric, e.g., in our application to Hi-C, either U or V can be used to define the clusters (the factors are equivalent up to a scaling constant). Assuming we use U, there are two common approaches for finding clusters from NMF factors: (1) assign each row entity i to its most dominant factor, i.e., assign it to cluster \(c_{i} = \text {argmax}_{j \in \{ 1, \dots, k \}} u_{ij}\), or (2) apply k-means clustering on the rows of U. However, both approaches fall short in our application. The first approach is sensitive to extreme values which can still be present in the smoother factors, yielding non-informative clusters. Furthermore, neither approach reinforces contiguity of genomic regions in each cluster along their chromosomal position. As a result, a single cluster could potentially contain genomic regions from two opposite ends of the chromosomes instead of being a contiguous local structural unit. To address this problem, we apply chain-constrained k-medoids clustering. k-medoids clustering is similar to k-means clustering, except that the “center” of each cluster is always an actual data point, rather than the mean of the datapoints in the cluster. In its chain-constrained version (Additional File 2, Algorithm 1), adopted from spatially connected k-medoids clustering [70], each cluster grows outwards from initial medoids along the linear chromosomal coordinates. The algorithm assigns a genomic region to a valid medoid region either upstream or downstream along the chromosome, ensuring the contiguity of the clusters and resilience to noise or extreme outliers provided by using a robust ‘median’-like cluster center rather than a ‘mean’-like center used in k-means clustering.
Selecting GRiNCH hyperparameters
GRiNCH has three hyper-parameters: (a) k, the rank of the lower-dimensional matrices, which can alternately be viewed as the number of latent features or clusters; (b) r, the radius of the neighborhood in the graph used for regularization; and (c) λ controlling the strength of regularization.
The parameter k determines the number of latent features to recover and the resulting number of GRiNCH TADs. We can obtain subTAD-, TAD-, or metaTAD-scale clusters (Additional File 1, Figure S14a) by setting k such that the expected size of a cluster is 500kb, 1Mb, or 2Mb, i.e., k equals the given chromosome’s length divided by the expected size. We find that a larger portion of subTAD-scale clusters (i.e., expected TAD size = 500kb) have significant internal validation metric values (Additional File 1, Figure S14b). SubTAD-scale clusters tend to be more stable to depth and sparsity (Additional File 1, Figure S14c) and are also more enriched in boundary elements like CTCF (Additional File 1, Figure S15a). As a tradeoff, higher proportion of metaTAD-scale clusters (i.e., expected cluster size = 2Mb) are enriched in histone modification marks (Additional File 1, Figure S15b). Based on the use case of GRiNCH, k can be set dynamically by the user; by default, GRiNCH sets k such that the expected size of a cluster is 1Mb, or at TAD-scale.
For regularization strength, λ∈{0,1,10,100,100} were considered, with λ=0 equivalent to standard NMF without regularization. For neighborhood radius, r∈{25K, 50K, 100K, 250K, 500K,1M} were considered, where r=100K in a Hi-C dataset of 25-Kb resolution will use 4 bins on either side of a given region as its neighbors. We find that some regularization, with λ=1, yields better CTCF enrichment than other λ values (Additional File 1, Figure S1a). With regularization, a neighborhood radius of 100Kb or larger yields higher CTCF enrichment (Figure S1B). We also note that the regularization parameters do not discernibly change the TAD size distribution (Additional File 1, Figure S16). Based on these results, the default regularization parameters for GRiNCH are set at λ=1 and r=250kb.
Memory consumption and runtime
In graph-regularized NMF, the size of the input matrix n and the reduced dimension k are the main drivers of computational complexity which is O(kn2) [36]. We measured memory consumption (maximum resident set size) and runtime of GRiNCH across five cell lines (GM12878, HMEC, HUVEC, NHEK, K562) with different combinations of input matrix size (determined by chromosome length and Hi-C resolution), expected cluster/TAD size (which determines k for a given matrix), and regularization parameters (λ∈{0,1,10,100,100} and neighborhood radius ∈{ 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb }). These runs were completed across a distributed computing platform with machines of varying computing power. We plot the maximum resident set size and runtime against input matrix size in Figure S17 (Additional File 1). We observe that, in concordance with the computational complexity, time consumption increases in a quadratic fashion with respect to the input matrix size and in a linear fashion to k. Memory consumption increases in a similar manner, i.e., if the input matrix size doubles, the memory requirement approximately quadruples.
Stability and initialization of NMF
The NMF algorithm is commonly initialized with random non-negative values for the entries of U and V. The initial values can significantly impact the final values of U and V [71]. This leads to instability of the final factors hinging on the randomization schemes or changing seeds. To address the instability, we used Non-Negative Double Singular Value Decomposition (NNDSVD), which initializes U and V with a sparse SVD approximation of the input matrix X [72]. Since the derivation of exact singular values can considerably slow down the initialization step, we use a randomized SVD algorithm which derives approximate singular vectors [73]. NNDSVD initialization with randomized SVD results in lower loss, i.e., factors that can better approximate the original Hi-C matrix, in fewer iterations (Additional File 1, Figure S18a,b), and more stable results than direct random initialization (Additional File 1, Figure S18c,d).
Datasets used in experiments and analysis
High-throughput chromosome conformation capture datasets
We applied GRiNCH to interaction count matrices from in situ Hi-C (with MboI as the restriction enzyme) for five cell lines, GM12878, NHEK, HMEC, HUVEC, and K562 at 10-kb, 25-kb, and 50-kb resolution [37] (GEO accession: GSE63525). From the same source, we additionally used GM12878 25kb-resolution data from in situ Hi-C using DpnII as the restriction enzyme, and GM12878 25-kb resolution data from dilution Hi-C using HindIII as the restriction enzyme in our analysis on smoothing.
To demonstrate the applicability of GRiNCH to multiple high-throughput chromosome conformation capture platforms, we applied GRiNCH to datasets from other technologies that capture the 3D genome structure and chromatin interactions: Split-Pool Recognition of Interactions by Tag Extension (SPRITE) [9] and HiChIP [38]. We used the SPRITE data for GM12878 cell line (GEO accession: GSE114242). For HiChIP, we applied GRiNCH to the contact matrices from cohesin HiChIP (GEO accession: GSE80820) [38] and H3k27ac HiChIP (GEO accession: GSE101498) [60].
To demonstrate the utility of GRiNCH to study 3D genome organization in dynamic processes, we applied GRiNCH to two different mouse developmental time course data: (a) neural differentiation Hi-C data from embryonic stem cells (mESC), neural progenitors (NPC), and cortical neurons (CN) (GEO accession: GSE96107) [45] and (b) Hi-C data from reprogramming pre-B cells to induced pluripotent state [46] (GEO accession: GSE96553). For (a) neural differentiation dataset, Juicer Straw tool [74] was used to obtain 25kb Hi-C matrices with vanilla-coverage square-root normalization. For (b) reprogramming, we applied GRiNCH to published normalized Hi-C data from pre-B cells, B α cells, day 2, day 4, day 6, and day 8 of reprogramming, and finally, pluripotent cells.
ChIP-seq, DNase-seq, ATAC-seq, and motif datasets
To interpret the GRiNCH results and for comparison to other methods, we obtained a number of ChIP-seq datasets. For CTCF, ChIP-seq narrow-peak datasets available as ENCODE Uniform TFBS composite track [75] were downloaded from the UCSC genome browser (wgEncodeEH000029, wgEncodeEH000075, wgEncodeEH000054, wgEncodeEH000042, wgEncodeEH000063).
As ChIP-seq data for SMC3 and RAD21 are not available in the five cell lines from Rao et al. [37], we generated a list of cell-line-specific accessible motif sites. Accessible motif sites were defined as the intersection of motif match regions and DNase-accessible regions in the given cell line. The SMC3 and RAD21 motif matches to the human genome (hg19) were obtained from [76]. To create a union of DNase accessible regions from replicates within a cell line, BEDtools [77] merge program was used. Finally, the intersection of DNase accessible regions and motif match regions was calculated for each cell line using BEDtools intersect program. DNase accessibility sites were obtained from the ENCODE consortium [78, 79]: ENCFF856MFN, ENCFF235KUD, ENCFF491BOT, ENCFF946QPV, ENCFF968KGT, ENCFF541JWD, ENCFF978UNU, ENCFF297CKS, and ENCFF569UYX.
We obtained ChIP-seq datasets for histone modification marks from the ENCODE consortium [78, 79]. To generate genome-wide histone modification levels for each mark, fastq reads were aligned to the human genome (hg19) with bowtie2 [80] and aggregated into a base-pair signal coverage profile using SAMtools [81] and BEDtools [77]. The base-pair signal coverage was averaged within each 25-kb bin to match the resolution of Hi-C dataset. The aggregated signal was normalized by sequencing depth within each replicate; the replicates were collapsed into a single value by taking the median.
In order to identify novel transcription factors that could play a role in 3D genome organization, we obtained motifs of 746 different transcription factors from JASPAR core vertebrate collection [53]. Next, we obtained their accessible motif match sites in hg19 for the five cell lines from [37] using the same process that was used for SMC3 and RAD21 motifs. To identify the accessible motif sites for mouse cells during pluripotency reprogramming [46], we aligned ATAC-seq fastq reads to the mouse genome (mm10) with bowtie2 [80] and deduplicated with SAMtools [81]. Accessible peaks were called with MACS2 [82]. The ATAC-seq peaks were then used in place of DNase-seq sites to find the accessible motif sites as was done for SMC3 and RAD21 motifs.
TAD-calling methods
GRiNCH was benchmarked against 7 other TAD-calling methods: directionality index method [23], Armatus [20], insulation score method [25], rGMAP [24], 3DNetMod [22], HiCseg [83], and TopDom [84]. For all methods, default or recommended parameter values were used when available. Execution scripts containing the parameter values used for these methods are available to download (“Availability of data and materials” section).
Directionality index
Directionality index uses a hidden Markov model (HMM) on estimated directionality index (DI) scores. The DI score for a genomic region is determined by whether the region preferentially interacts with upstream or with downstream regions. A bin can take on one of three states (upstream-biased, downstream-biased, or not biased) based on the interaction profile within a fixed-sized (e.g., 2Mb) window up- and downstream of the bin, with directionally biased bins becoming TAD boundaries. TADs were called using the directionality index method implementation in TADtool [85], version as of April 23, 2018.
Armatus
Armatus uses dynamic programming to find subgraphs in a network where the nodes are the genomic regions, and the edge weights are the interaction counts. The objective is to find the set of dense subgraphs; subgraph density is defined as the ratio of the sum of edge weights to the number of nodes within the subgraph. Armatus predicts a set of overlapping TADs, then consolidates them into consensus TADs. The consensus TADs were used in our analysis. Armatus version 2.3 was used for comparison.
Insulation score
In the insulation score method, each bin is assigned an insulation score, calculated as the mean of the interaction counts in the window (of a predefined size) centered on the given bin. Bins corresponding to the local minima in the vector formed by these insulation scores are treated as TAD boundaries. TADtool [85] implementation of insulation score method, version as of April 23, 2018, was used in our experiments.
3DNetMod
3DNetMod employs a Louvain-like algorithm to partition a network of genomic regions into communities where the edge weights in the network are the interaction counts. It uses greedy dynamic programming to maximize modularity, a metric of network structure measuring the density of intra-community edges compared to random distribution of links between nodes. 3DNetMod outputs a set of overlapping TADs. It was excluded from any analysis that required a unique TAD assignment for each genomic region or involved TAD shuffling. Software version 1.0 (10/06/17) was used in our comparison.
rGMAP
rGMAP trains a two-component Gaussian mixture model to group interactions into intra-domain or inter-domain contacts. Putative TAD boundary bins are identified by those with significantly higher intra-domain counts in its upstream window or downstream window of predefined size. The chromosome is then segmented into TADs flanked by these boundaries. rGMAP outputs a set of hierarchical, overlapping domains and a set of non-overlapping TADs; we used the latter in our analysis. Software version as of April 23, 2018, was used for comparison.
HiCseg
HiCseg treats the Hi-C matrix as a 2D image to be segmented, with each block-diagonal segment corresponding to a TAD. The counts within each block are modeled to be drawn from a certain distribution (e.g., Gaussian distribution for normalized Hi-C data). Using dynamic programming, HiCseg finds a set of block boundaries that would maximize the log likelihood of counts in each block being drawn from an estimated distribution. Version 1.1 was used in our experiments.
TopDom
TopDom generates a score for each bin along the chromosome, where the score is the mean interaction count between the given bin and a set of upstream and downstream neighbors (neighborhood size is a user-specified parameter). Putative TAD boundaries are picked from a set of bins whose score forms a local minimum; false-positive boundaries are filtered out with a significance test. Version 0.0.2 was used in our analysis.
TAD evaluation criteria
We evaluated the quality of TADs using different enrichment metrics as well as internal validation metrics used for comparing clustering algorithms.
Enrichment analysis
Enrichment of known architectural proteins. We estimated the enrichment of three known architectural proteins (CTCF, RAD21, and SMC3) in the TAD boundaries of five cell lines from Rao et al. [37]. TAD boundaries are defined by the starting bin and the ending bin of each predicted TAD, along with one preceding the starting bin and one following the ending bin. Let N be the total number of bins in a chromosome, nBIND be the number of bins with one or more ChIP-seq peaks or accessible motif sites, nTAD be the number of TAD boundary bins, and nTAD-BIND be the number of TAD-boundary bins with a binding event (ChIP-seq peak or accessible motif match site). The fold enrichment for a particular protein is calculated as: \(\frac {n_{\text {TAD-BIND}}/n_{\text {TAD}}}{n_{\text {BIND}}/N}\). Within each cell line, the fold enrichment across all chromosomes was averaged; then, the mean across cell lines was used to rank the TAD-calling methods (Additional File 3, Table S1h).
Histone modification enrichment. We used the proportion of predicted TADs that are significantly enriched in histone modification signals (compared to the “null” histone-modification signal distribution of randomly shuffled TADs) as a validation metric to assess the quality of TADs, similar to Zufferey et al. [28]. For each predicted TAD, we calculated the mean histone modification ChIP-seq signal within the TAD. Next, we find the “null” histone-modification signal distribution from randomly shuffled TADs. To generate randomly shuffled TADs, we take the lengths of all predicted TADs within a chromosome, as well as the lengths of interspersed stretches between the TADs (i.e., “non-TAD” stretches) if a TAD-calling method skips over regions of the genome. Next, we randomly move around the TAD and non-TAD stretches within the chromosome to preserve the TAD length distribution. We repeat this procedure 10 times. Then, we compute the mean histone modification ChIP-seq signal within each of these randomly shuffled TADs, generating the null or background distribution of histone modification signals. The empirical p-value of a predicted TAD’s histone modification signal was calculated as the proportion of randomly shuffled TADs with higher ChIP-seq signal than that of the given TAD. A TAD was considered significantly enriched if its empirical p-value was less than 0.05, i.e., more than 95% of randomly shuffled TADs had a lower histone modification signal. Finally, for each TAD-calling method, we found the proportion of predicted TADs with significant histone modification signal; this is visualized across cell lines in Fig. 5B. The mean proportion of TADs with significant enrichment across chromosomes and cell lines was used to rank the TAD-calling methods (Additional File 3, Table S1i).
Internal validation metrics
Since a TAD represents a cluster of contiguous regions that tend to interact more among each other than with regions from another TAD or cluster, we used two internal validation or cluster quality metrics, Davies-Bouldin index (DBI) and Delta contact count (DCC), to evaluate the similarity of interaction profiles among regions within a TAD. Specifically, for each method, we generated a background/null distribution of DBI and DCC from randomly shuffled TADs, then measured the proportion of actual TADs called with significant DBI and DCC level (p-value <0.05) against this null distribution (similar procedure to “Histone modification enrichment” above).
Davies-Bouldin index (DBI). The DBI for a single cluster Ci is defined as its similarity to its closest cluster Cj, where \(i,j \in \{1, \dots, k\}, i \neq j\): DBIi= maxi≠jSij. The similarity metric, Sij, between Ci and Cj is defined as:
$$ S_{ij} = \frac{{d}_{i} + {d}_{j}}{\text{distance}_{ij}} $$
(5)
where di is the average distance between each data point in cluster Ci and the cluster centroid and distanceij is the distance between the cluster centroids of Ci and Cj. In applying DBI to Hi-C data, a data point consists of a vector of a genomic region’s interaction counts with other regions in the chromosome (e.g., an entire row or column in the Hi-C matrix); a cluster corresponds to a group of regions within the same TAD; the cluster centroid is a mean vector of rows that belong to the same cluster/TAD. The smaller the DBI, the more distinct the clusters are from one another.
For each method, we computed the DBI of each individual TAD. To measure the significance of a TAD’s DBI value, we generated a background/null distribution of DBI values from randomly shuffled TADs (refer to the procedure in “Histone modification enrichment” above). The empirical p-value of a TAD was calculated as the proportion of randomized TADs with lower DBI (recall a lower DBI means better clustering) than that of the given TAD. A TAD was considered to have a significant DBI if its empirical p-value was less than 0.05; the proportion of TADs with significant DBI was calculated for each method and used for comparing different TAD calling methods (Additional File 3, Table S1a).
Delta Contact Count (DCC). DCC for cluster Ci is defined as follows: let ini denote the mean interaction counts between pairs of regions that are both in Ci, and outi denote the mean interaction counts between pairs of regions where one region is in cluster Ci and the other region is not. Then DCCi=ini−outi.
We expect that for a good cluster, the pairs of regions within the cluster should have higher contact counts. Therefore, the higher the value of DCC, the higher the quality of the cluster. Again, a cluster corresponds to a group of regions within the same TAD. Given the DCC values for each TAD, we determined its significance against the null/background distribution of DCC values from randomly shuffled TADs (refer to procedure in “Histone modification enrichment” and “Davies-Bouldin index (DBI)” above). The mean proportion of TADs with significant DCC across cell lines was used to compare the TAD-calling methods (Additional File 3, Table S1b).
TAD similarity and stability metrics
When assessing the similarity or stability of TADs, we used two cluster comparison metrics, Rand index and mutual information. First, TADs were converted to clusters so that regions in the same TAD were all assigned to the same cluster; all non-TAD regions, if a TAD-calling algorithm should have them, were assigned to a single cluster together. When comparing TADs across different resolutions of Hi-C data, 10-kb, 25-kb, and 50-kb bins were split into a size of lowest common denominator, i.e., 5kb. Then all 5-kb bins were assigned to the same cluster as in the original lower-resolution bin (e.g., a 10-kb bin assigned to cluster i would yield two 5-kb bins assigned to cluster i). For these comparisons, we computed these metrics at the 5-kb resolution.
For Rand index, each genomic region is treated as a node in a graph; two nodes are connected by an edge if they are in the same cluster. Then, the number of edges that were preserved between clustering result A and clustering result B is divided by the total number of pairs of nodes, i.e., number of edges in a fully connected graph. Rand index of 1 corresponds to perfect concordance between two clustering results; Rand index of 0 means no agreement.
Mutual information (MI) is an information-theoretic metric measuring the dependency between two random variables, where each variable can be a clustering result. Specifically, for two discrete variables A and B, MI is defined as
$$ \textup{MI} (A;B) = \sum_{a \in A} \sum_{b \in B} p_{(A, B)} (a,b) \log \left(\frac{ p_{(A,B)} (a,b) }{ p_{A}(a) p_{B}(b) } \right) $$
(6)
For clustering comparisons, A and B are cluster assignments to be compared, e.g., A is the cluster assignment corresponding to TADs from high-depth data and B is the cluster assignment based on TADs from downsampled data. Mutual information is 0 if the joint distribution of A and B p(A,B)(a,b) equals the product of each marginal distribution, i.e., A and B are independent, or in an information-theoretic sense, knowing A does not provide any information about B. The higher the mutual information value, the greater the information conveyed by the variables about each other; in the context of measuring clustering agreement, one clustering result is similar to the other.
Both metrics were used to evaluate the stability of TADs across resolution and depth, the similarity of TADs from different TAD-calling methods, the recovery of TADs from smoothed Hi-C data, the similarity of TADs along the time-course data, and the consistency of GRiNCH TADs from different 3D genome capturing technologies (e.g., SPRITE, HiChIP). To rank TAD-calling methods based on stability across resolution and depth, the mean Rand index or mutual information across cell lines was used (Additional File 3, Table S1d-g).
Robustness to low-depth data
To assess the robustness or stability of TADs to low-depth input data, the TADs from a high-depth dataset (GM12878) [37] were compared to the TADs from a downsampled, low-depth dataset. If the TADs from the downsampled data are similar to TADs from the high-depth dataset, they are considered to be stable to low depth. The similarity metrics, mutual information and Rand index described in the “TAD similarity and stability metrics” section, were used.
In order to downsample a high-depth Hi-C matrix (e.g., from GM12878) to a lower depth one (e.g., from HMEC), a distance-stratified approach was used to match both the mean of non-zero counts and sparsity level between the two datasets. First, for each distance threshold d, let \(\mu _{d}^{h}\) denote the mean of the non-zero counts in the high-depth dataset and \(\mu _{d}^{l}\) denote the mean of non-zero counts in the low-depth dataset. The scaled down value for each non-zero entry of the original high-depth dataset is: \(\widetilde {x}_{ij} = \frac {x_{ij}^{h}}{ \mu _{d}^{h}/\mu _{d}^{l}}\). where \(x_{ij}^{h}\) is the value for the i, j bin pair in the high-depth dataset. Then, to increase the sparsity of the high-depth dataset, zd of the non-zero counts in the high-depth dataset at distance d is randomly set to zero, where zd is the number of additional entries in the low-depth dataset that are zero compared to the high-depth dataset.
Identification of candidate genomic regions involved in 3D organization changes during mouse neural development
To identify genomic regions potentially involved in local topological changes during the mouse neural development, we took GRiNCH clusters from the Hi-C data of mouse embryonic stem cells (mESC), neural progenitors (NPC), and cortical neurons (CN) [45] and looked for cluster merges or splits across the time points. We first performed pairwise cluster matching between time points (e.g., mESC vs CN). For each pair of clusters from time point A (e.g., cluster i from mESC) and time point B (e.g., cluster j from CN), we calculated their overlap in genomic regions with Jaccard index, i.e., the ratio of the size of the intersection (regions in both clusters) to the size of the union (regions in either cluster). We then considered clusters matched to two or more clusters in another time point with a Jaccard index of at least 0.2. For example, if cluster 5 from mESC matched to clusters 4, 5, and 6 from CN with Jaccard index of 0.3, 0.25, and 0.4, respectively, then we considered cluster 5 in mESC as a site of potential topological changes, identified by cluster splits in CN. We selected a random subset of these clusters from different chromosomes and visualized the interaction profile of the regions belonging to these clusters. The regions and clusters visualized in Fig. 7b and Additional File 1, Figure S8 are from this subset.
Identification of novel factor enrichment at GRiNCH TAD boundaries
A similar procedure to CTCF boundary enrichment was used to identify novel boundary elements, by assessing whether the accessible motif sites of 746 transcription factors (TFs) from the JASPAR core vertebrate collection [53] are enriched in GRiNCH TAD boundaries. This procedure was applied to the five cell lines from Rao et al. [37] and the time points from the mouse reprogramming timecourse data [46]. One change to the procedure was that instead of calculating fold enrichment per chromosome, all counts were aggregated across all chromosomes within the given cell line or time point. The hypergeometric test was used to calculate the significance of the number of TF sites in the boundaries and were ranked based on their p-value.
Smoothing methods
Smoothing with GRiNCH via matrix completion GRiNCH smooths a noisy input Hi-C matrix by using the matrix completion aspect of NMF. Specifically, the reconstructed matrix Xs=UVT is the smoothed matrix. The effectiveness of GRiNCH matrix completion as a smoothing method was compared to that of mean filter and Gaussian filter, two methods used in image blurring [86] and Hi-C data pre-processing [30, 42, 43], as well as HiCNN [32], a method based on convolutional neural network to impute interaction counts.
Mean filter
Mean filtering is used in HiCRep [30] as a preprocessing step to measure reproducibility of Hi-C datasets. To create a smoothed matrix Xs from a given input matrix X with a mean filter, each element in \(x_{ij}^{s}\) is estimated from the mean of its neighboring elements within radius r: \(x_{ij}^{s} = \frac {1}{(2r+1)^{2}} \sum _{a = i-r}^{i+r} \sum _{b = j-r}^{j+r} x_{ab}\). Three different values for the radius r were considered: r∈{3,6,11}.
Gaussian filter
A Gaussian filter has been used as a preprocessing step to identify chromatin loops and differential interactions from Hi-C Data [42, 43]. It uses a weighted mean of the neighborhood of a particular contact count entry, xij, where the weight is determined by the distance of the neighbor from the given position:
$$ x_{ij}^{s} = \frac{1}{2\pi \sigma^{2}} \sum_{a = i-n}^{i+n} \sum_{b = j-n}^{j+n} e^{-\frac{(i-a)^{2} + (j-b)^{2}}{2\sigma^{2}}}x_{ab} $$
(7)
Three different values of (σ) were considered, σ∈{1,2,3} and n was set to 4∗σ.
HiCNN
Unlike mean filter, Gaussian filter, and GRiNCH, HiCNN [32] uses supervised learning to perform smoothing. HiCNN uses a 54-layer convolutional neural network trained to predict high-resolution Hi-C interaction matrices from downsampled lower-resolution matrices. We downloaded three pre-trained models (from dna.cs.miami.edu/HiCNN along with source code) which were trained on GM12878 Hi-C data downsampled to 1/8, 1/16, and 1/25 depth of the original data, respectively. We used these pre-trained models in the smoothing analysis. These models were trained on interactions <2Mb apart and only make predictions for interaction distances <2Mb. To accommodate this limitation, AUPR on significant interaction recovery was measured separately for interactions <2Mb apart (see the “Assessment of benefits from smoothing” section below). Measuring TAD recovery after smoothing was not affected since the directionality index method uses a 2-Mb-sized window of interactions (see the “Directionality index” section above).
Assessment of benefits from smoothing
Recovery of TADs from smoothed downsampled data
To assess whether smoothing helps preserve or recover structure in low-depth data, we first smoothed downsampled low-depth datasets (see the “Robustness to low-depth data” section) using methods described above (see the “Smoothing methods” section). The Directionality index (DI) TAD-finding method was applied to the high and low-depth datasets. Then the similarity of the TADs from the original high depth data and the TADs from the smoothed data were measured (see the “TAD similarity and stability metrics” section”). Higher similarity metric values imply better recovery of structure from smoothing.
Recovery of significant interactions
Fit-Hi-C [44] was used to call significant interactions in the original and the smoothed Hi-C datasets, using a q-value <0.05. Interactions from the original high-depth Hi-C dataset were used as the set of “true” significant interactions. From the downsampled then smoothed matrices, each smoothed interaction count was assigned a “prediction score” of 1−q, where q is its Fit-Hi-C q-value. Precision and recall curves were then computed using the “true” interactions and the “prediction scores.” The recovery of true significant interactions was measured with the area under the precision-recall curve (AUPR).
Robustness to different restriction enzymes
In the smoothing analysis of data from Hi-C protocols using different restriction enzymes (HindIII, DpnII, MboI), the overlap of significant interactions was measured with Jaccard index, which is the ratio of the size of the intersection (i.e., significant interactions called in both datasets compared) to the size of the union (i.e., significant interactions called in either one of the datasets).