Simultaneous smoothing and detection of topological units of genome organization from sparse chromatin contact count matrices with matrix factorization

The three-dimensional (3D) organization of the genome plays a critical role in gene regulation for diverse normal and disease processes. High-throughput chromosome conformation capture (3C) assays, such as Hi-C, SPRITE, GAM, and HiChIP, have revealed higher-order organizational units such as topologically associating domains (TADs), which can shape the regulatory landscape governing downstream phenotypes. Analysis of high-throughput 3C data depends on the sequencing depth, which directly affects the resolution and the sparsity of the generated 3D contact count map. Identification of TADs remains a significant challenge due to the sensitivity of existing methods to resolution and sparsity. Here we present GRiNCH, a novel matrix-factorization-based approach for simultaneous TAD discovery and smoothing of contact count matrices from high-throughput 3C data. GRiNCH TADs are enriched in known architectural proteins and chromatin modification signals and are stable to the resolution, and sparsity of the input data. GRiNCH smoothing improves the recovery of structure and significant interactions from low-depth datasets. Furthermore, enrichment analysis of 746 transcription factor motifs in GRiNCH TADs from developmental time-course and cell-line Hi-C datasets predicted transcription factors with potentially novel genome organization roles. GRiNCH is a broadly applicable tool for the analysis of high throughput 3C datasets from a variety of platforms including SPRITE and HiChIP to understand 3D genome organization in diverse biological contexts.


Introduction
Results enrichment as the other top performing methods, namely, Directionality Index and Insulation Score in most cell lines, and HiCseg in K562 and Huvec. All these methods including GRiNCH have significantly 141 higher enrichment than 3DNetMod, rGMAP, Armatus across different cell lines. 142 As histone modifications have been shown to be associated with three-dimensional organization [39], TADs that are associated with significant histone modifications. However, GRiNCH ranks among the 154 top methods for both criteria suggesting that GRiNCH TADs capture a diverse types of TADs. 155 GRiNCH smoothing of low-depth datasets help recover structure and significant 156 interactions. 157 Our analysis so far compared different TAD finding methods for their ability to recover stable and bio- Hi-C matrix. We next compared GRiNCH's smoothing functionality to common smoothing techniques 164 such as mean filter and Gaussian filter, which are used in imaging domains and also for Hi-C data [30]. 165 We used two metrics to assess the quality of smoothing: (a) recovery of TADs and (b) recovery of sig-166 nificant interaction after smoothing downsampled data. To perform these comparisons, we again used 167 the downsampled GM12878 datasets. 168 To assess TAD recovery, we identified TADs on the original high-depth GM12878 dataset and com-169 pared them to the TADs identified in the downsampled and smoothed data matrices using Rand Index 170 and Mutual Information. Here, to avoid any bias in our interpretation, we used the Directionality Index method to call TADs. We find that using both Rand Index and Mutual Information, TADs recovered on ent parameter settings of the mean filter and Gaussian filters ( Figure 5A). The usefulness of GRiNCH 174 is more apparent for lower-depth datasets such as NHEK. To assess the recovery of significant interac-  GRiNCH application to chromosomal organization during development. 185 To assess the value of GRiNCH in primary cells and to examine dynamics in chromosomal organiza-186 tion, we applied GRiNCH to two time-course Hi-C datasets profiling 3D genome organization during (a) 187 mouse neural development [41] and (b) pluripotency reprogramming in mouse [42]. Bonev et al. [41] 188 used high-resolution Hi-C experiments to measure 3D genome organization during neuronal differenti-189 ation from the embryonic stem cell state (mESC) to neural progenitor cells (NPCs) and cortical neurons 190 (CNs). We applied GRiNCH on all chromosomes for all three cell types and compared them based on 191 the overall similarity of TADs between the cell lines. Based on the two metrics of Mutual Information 192 and Rand Index, the overall TAD similarity captured the temporal ordering of the cells, with CNs being 193 closer to NPCs and ESCs the most distinct ( Figure S3A). We next focused on a specific 4Mbp region 194 around the Zfp608 gene, which was found by Bonev et al. as a neural-specific gene associated with 195 a changing TAD boundary. In both NPCs and CNs, GRiNCH predicts a TAD near the Zfp608 gene, 196 which is not present in the mESC state. Zfp608 was also associated with increased expression, and ac-197 tivating marks, H3K27ac and H3K4me3 at these time points, which is consistent with Zfp608 being a 198 neural-specific gene ( Figure 6A). 199 We next examined another time-course dataset which studied the 3D genome organization during to all chromosomes from each time point and compared the overall 3D genome configuration over time. 203 Here too we observed that time points closer to each other generally had greater similarity in their TAD 204 structure, as well as two different replicates within the same time point displaying even greater TAD 205 similarity ( Figure S4B). We examined the interaction profile in the 1.3 Mbp around the Sox2 gene, a 206 known pluripotency gene ( Figure 6B). We see a gradual formation of a boundary around Sox2, which is 207 also associated with concordant increase in expression, accessibility and the presence of H3K4me2, an 208 active promoter mark. 209 As chromatin accessibility data was also measured at each timepoint during reprogramming, we 210 asked if we could identify additional regulatory proteins that could play a role in establishing TADs 211 (Methods). Briefly, we tested the GRiNCH TAD boundaries from each mouse cell type, from pre-  Table S2). The top-ranking TF across the cell types was 215 CTCF, which is consistent with its role as an architectural protein in establishing TADs ( Figure 6C). 216 We also found other factors in the same zinc finger protein family as CTCF [44], such as ZBTB14, [37], and H3k27ac HiChIP [50]. A visual comparison between these datasets for an 8Mb region of  Figure 7E) and Mutual Information ( Figure 7F). Interestingly, the GRiNCH TADs from Hi-C are the 248 most similar to those from cohesin HiChIP and this similarity measure is higher than between the two 249 HiChIP datasets. This is consistent with cohesin being a major determinant for the formation of loops 250 detected in HiC datasets. The H3K27ac HiChIP data is as close to Hi-C as it is to cohesin HiChIP. Finally

Discussion
We present GRiNCH, a graph-regularized matrix factorization framework that enables reliable identifi-258 cation of high-quality genome organizational units, such as TADs, from high-throughput chromosome 259 conformation capture datasets. GRiNCH is based on a novel constrained matrix factorization and cluster-260 ing approach that enables recovery of contiguous blocks of genomic regions sharing similar interaction 261 patterns as well as smoothing sparse input datasets.

262
A lack of gold standards for TADs emphasizes the need to probe both the statistical and biological 263 nature of inferred TADs. Through extensive comparison of GRiNCH to existing methods with good per-264 formance in other benchmarking studies, we identified strengths and weaknesses of existing approaches. 265 In particular, methods like Insulation Score identify TADs that are generally more enriched for signals 266 such as CTCF and cohesin; however, when comparing statistical properties such as stability across reso- Hi-C data (e.g. measuring reproducibility or concordance between Hi-C replicates [31]). We find that 274 GRiNCH smoothing outperforms existing smoothing methods (mean filter and Gaussian filter) in its 275 ability to retain TAD-level and interaction-level features of the input Hi-C data. Furthermore, GRiNCH 276 is applicable to datasets from a wide variety of platforms, including SPRITE and HiChIP. Application 277 of GRiNCH shows that Hi-C and HiChIP datasets capture more similar topological units than SPRITE. 278 Interestingly, TADs from Hi-C and cohesin HiChIP are much closer than the two HiChIP datasets we 279 compared. This shows that GRiNCH is capturing TADs that are reproducible across platforms. To study 280 the ability of GRiNCH to identify dynamic topological changes along a time course, we applied GRiNCH  target genes in the brain [49]. Therefore the simultaneous enrichment of MEIS3 and HOX sites is con-302 sistent with HOX proteins requiring MEIS3 for binding; however, its specific role in the hematopoietic 303 lineage is yet unknown. Investigating the interactions of these proteins with well-known architectural  In conclusion, GRiNCH offers a unified solution, applicable to diverse platforms, to discover reliable 315 and biologically meaningful topological units, while handling sparse high-throughput chromosome con-316 formation capture datasets. The outputs from GRiNCH can be used to predict novel boundary elements, 317 enabling us to test possible hypotheses of other mechanisms for TAD boundary formation. We have 318 made GRiNCH available at roy-lab.github.io/grinch, with a comprehensive installation and usage man-319 ual. As efforts to map the three-dimensional genome organization expand to more conditions, platforms, 320 and species, a method such as GRiNCH will serve as a powerful analytical tool for understanding the 321 role of genome 3D organization in diverse complex processes.

Materials and Methods
Hi-C data (GRiNCH) framework 325 GRiNCH is based on a regularized version of non-negative matrix factorization (NMF) [35] that is appli-326 cable to high-dimensional chromosome conformation capture data such as Hi-C ( Figure 1). Below we 327 describe the components of GRiNCH: NMF, graph regularization, and clustering for TAD identification.

Non-negative matrix factorization (NMF) and graph regularization 329
Non-negative matrix factorization is a popular dimensionality reduction method that aims to decompose 330 a non-negative matrix, X ∈ R (n×m) into two lower dimensional non-negative matrices, U ∈ R (n×k) and 331 V ∈ R (n×k) , such that the product X * = UV T , well approximates the original X. We refer to the U and 332 V matrices as factors. Here k << n, m is the rank of the factors and is user-specified. 333 In application of NMF to Hi-C data, we represent the Hi-C data for each chromosome as a symmetric 334 matrix X = [x ij ] ∈ R (n×n) where x ij represents the contact count between region i and region j. We 335 note that in the case of a symmetric matrix, U and V are the same or related by a scaling constant. 336 The goal of NMF is to minimize the following objective:

337
A number of algorithms to optimize this objective have been proposed; here we used the multiplicative 338 update algorithm, where the entries of U and V are updated in an alternating manner each iteration: Here u ik corresponds to the i th row of column U(:, k) and v jk corresponds to the j th row of column 340 V(:, k).

341
Standard application of NMF to Hi-C data is ignorant of the strong distance dependence of the count 342 matrix, that is, genomic regions that are close to each other tend to interact more with each other. To we define a symmetric nearest-neighbor graph, W: where N r (x i ) denotes r nearest neighbors in linear distance to region x i .

350
Graph regularized NMF has the following objective: where D is a diagonal matrix whose entries are column (or row, since W is symmetric) sums of Both r (neighborhood radius) and λ are parameters that can be specified, with λ setting the strength instead of being a contiguous local structural unit. To address this problem, we apply chain-constrained 371 k-medoids clustering. k-medoids clustering is similar to k-means clustering, except that the "center" 372 of each cluster is always an actual data point, rather than the mean of the datapoints in the cluster.

373
In its chain-constrained version (Algorithm 1), adopted from spatially connected k-medoids clustering 374 [60]: each cluster grows outwards from initial medoids along the linear chromosomal coordinates. The 375 algorithm assigns a genomic region to a valid medoid region either upstream or downstream along the 376 chromosome, ensuring the contiguity of the clusters and resilience to noise or extreme outliers provided 377 by using a robust 'median'-like cluster center rather than a 'mean'-like center used in k-means clustering.

378
Algorithm 1: Chain-constrained k-medoids clustering Input: U ∈ R n×k , one of the factors from NMF, and maxIter, the maximum number of iterations Output: The cluster assignments, C ∈ {c 1 , c 2 , . . . , c n }, for each of the chromosomal bins 1 Initialize k medoids to be the rows with the largest value from each column of U 2 Initialize an empty priority queue Q 3 while numIter < maxIter do 4 Add current medoids to priority queue Q, with priority value of 0 // Q orders bins by ascending priority values. i.e. factors that can better approximate the original Hi-C matrix, in fewer iterations ( Figure S7A,B), and 409 more stable results than direct random initialization ( Figure S7C,D).

Datasets used in experiments and analysis 411
High-throughput chromosome conformation capture data 412 We applied GRiNCH to SQRTVC-normalized Hi-C matrices from five cell lines, GM12878, NHEK, In the insulation score method, each bin is assigned an insulation score, calculated as the mean of the  Since a TAD represents a cluster of contiguous regions that tend to interact more among each other than 526 with regions from another TAD or cluster, we used two internal validation or cluster quality metrics, 527 Davies-Bouldin Index and Delta count, to evaluate the similarity of interaction profiles among regions 528 within a TAD. 529 Davies-Bouldin Index (DBI) The DBI for a single cluster C i is defined as its similarity to its closest The similarity metric, S ij , between C i 531 and C j is defined as: where d i is the average distance between each data point in cluster C i and the cluster centroid and 533 distance ij is the distance between the cluster centroids of C i and C j . In applying DBI to Hi-C data, a data

538
For each TAD-calling method, we first computed the DBI for each TAD. Next, TADs and non-TAD 539 stretches were shuffled within each chromosome 10 times to yield randomized TADs. The empirical 540 p-value of a TAD was calculated as the proportion of randomized TADs with lower DBI (recall a lower 541 DBI means better clustering) than that of the given TAD.

542
Delta Contact Count (DCC) DCC for cluster C i is defined as follows: let in i denote the mean 543 interaction counts between pairs of regions that are both in C i , and out i denote the mean interaction 544 counts between pairs of regions where one region is in cluster C i and the other region is not. Then We expect that for a good cluster, the pairs of regions within the cluster should have higher contact 547 counts. Therefore, the higher the value of DCC, the higher the quality of the cluster. Again, a cluster 548 corresponds to a group of regions within the same TAD. Given the DCC values for each TAD, the 549 empirical p-value of a TAD was calculated as the proportion of randomized TADs with higher delta 550 count than that of the given TAD.

551
A TAD was considered to have significant DBI or DCC if its p-value was less than 0.05. The 552 mean proportion of TADs with significant DBI/DCC across cell lines was used to rank the TAD-calling 553 methods (Table S1A, where A, B are random variables derived from clustering results, e.g. A is the cluster assignment cor-567 responding to TADs from high-depth data and B is the cluster assignment based on TADs from down- technologies (e.g. SPRITE, HiChIP). In ranking TAD-calling methods for stability across depth, the 577 mean Rand Index or Mutual Information across cell lines was used (Table S1D,E, Supplementary Data).

Robustness to low-depth data 579
To assess the robustness or stability of TADs to low-depth input data, the TADs from a high-depth dataset 580 (GM12878) [36] were compared to the TADs from a downsampled, low-depth dataset. If the original set 581 of TADs are similar to the set of TADs from downsampled data, they are considered to be stable to low 582 depth. The similarity metrics used are described in the "TAD similarity and stability metrics" section.

583
In order to downsample a high-depth Hi-C matrix (e.g. from GM12878) to similar levels as a lower 584 depth one (e.g. from HMEC), a distance-stratified approach was used to match both the mean of non- x ab Three different values of (σ) were considered, σ ∈ {1, 2, 3} and n was set to 4 * σ.

614
Assessment of benefits from smoothing 615 Recovery of TADs from smoothed downsampled data To assess whether smoothing helps pre-616 serve or recover structure from low-depth data, downsampled datasets (see "Robustness to low-depth    GRiNCH applies Non-negative Matrix Factorization (NMF) to a Hi-C or a similar high-throughput 3C matrix to find clusters of densely interacting genomic regions. NMF recovers low-dimensional factors U and V of the input matrix X that can be used to reconstruct the input matrix. As nearby genomic regions tend to interact more with each other, we regularize the factor matrices with a neighborhood graph to encourage neighboring regions to have a similar lower-dimensional representation, and subsequently belong to the same cluster. We cluster the regions by treating one of the factor matrices as a set of latent features and applying k-medoids clustering. The clusters represent topological units such as TADs. The factor matrices can be multiplied together to yield a smoothed version of the input matrix which is often sparse and noisy.      Shown are the mean fold enrichment of CTCF ChIP-seq peaks and accessible motif instances of cohesin proteins, RAD21 and SMC3, estimated across multiple chromosomes. The error bar denotes the standard deviation from the mean. B. Proportion of TADs with significant mean histone modification signal (i.e. empirical p-value < 0.05). The darker the entry the higher the proportion of TADs with significant histone enrichment. The average ChIP-seq signal for each histone modification mark was taken from within each TAD; the p-value of each TAD is derived from an empirical null distribution of mean signals in randomly shuffled TADs. Note: 3DNetMod outputted overlapping TADs, even when run under non-hierarchical settings; and it was excluded from this analysis as it involves TAD randomization/shuffling.