GRiNCH: simultaneous smoothing and detection of topological units of genome organization from sparse chromatin contact count matrices with matrix factorization

High-throughput chromosome conformation capture assays, such as Hi-C, have shown that the genome is organized into organizational units such as topologically associating domains (TADs), which can impact gene regulatory processes. The sparsity of Hi-C matrices poses a challenge for reliable detection of these units. We present GRiNCH, a constrained matrix-factorization-based approach for simultaneous smoothing and discovery of TADs from sparse contact count matrices. GRiNCH shows superior performance against seven TAD-calling methods and three smoothing methods. GRiNCH is applicable to multiple platforms including SPRITE and HiChIP and can predict novel boundary factors with potential roles in genome organization. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-021-02378-z).

S6 Similarity of topology and significant interactions among Hi-C data using different restric-  Figure S1: Selecting graph regularization parameters λ and neighborhood radius, r. λ controls the strength of regularization. r determines how many neighboring genomic regions will be used to influence a given region during the regularization process. A. Shown are the count of chromosomes in which the given λ value (row) gave the best CTCF enrichment within the given cell line (column), for different k settings denoted by subTAD, TAD, and metaTAD scale (see Figure S14, Figure S15). Within each cell line, for each chromosome, we ranked the tested parameter combinations of λ and r, and then counted the times a given λ yielded the best CTCF enrichment (regardless of r value). Due to ties in ranking, each column can add up to 23 or more. λ = 0 corresponds to vanilla NMF without regularization. B. Shown are the count of chromomes in which the given neighborhood radius value (row) gave the best CTCF enrichment in each cell line (column) at different k settings. As in A, we rank the parameter combinations of λ and r based on the CTCF enrichment, then count the number of times a particular value of r yielded the best fold enrichment. N/A corresponds to vanilla NMF without regularization. Figure S2: The size distribution of TADs identified by different methods from different resolutions of Hi-C data. Y-axis is in log10 scale of base pairs. The white dot in inside each violin represents the median; the black box inside each violin stretches from Q1 (25th percentile) to Q3 (75th percentile). Insulation method is missing TAD distributions from 10kb data because it did not return any TADs when using the same hyperparameters as in from 25kb and 50kb data. Figure S3: Similarity of TADs across resolutions for different methods, measured by mutual information. TADs were first converted to clusters, by assigning regions within the same TAD into a single cluster. To compare across resolutions, 10kb, 25kb, and 50kb bins were split into a size of lowest common denominator, i.e., 5kb. Then all 5kb bins were assigned to the same cluster as in the original lower-resolution bin (e.g. a 10kb bin assigned to cluster A would yield two 5kb bins assigned to cluster A). Finally, cluster assignments in these split, higher-resolution bins were compared for their similarity with Rand Index (Figure 3 in main text) and Mutual Information.  Figure S5: Proportion of TADs, from different Hi-C resolutions, with significant mean histone modification signal (i.e. empirical p-value < 0.05). The darker the entry the higher the proportion of TADs with significant histone enrichment. The average ChIP-seq signal for each histone modification mark was taken from within each TAD; the p-value of each TAD is derived from an empirical null distribution of mean signals in randomly shuffled TADs. Note: 3DNetMod outputted overlapping TADs and was excluded from this analysis as it involves TAD randomization/shuffling. Figure S6: Similarity of structure and significant interactions among Hi-C data using different restriction enzymes and Hi-C data smoothed by GRiNCH. A. Similarity of Directionality TADs, measured by Rand index, from Gm12878 Hi-C data using different restriction enzymes (HindIII, DpnII, MboI) and smoothed by GRiNCH (HindIII smoothed, DpnII smoothed, MboI smoothed). B. Similarity of Directionality TADs, measured by mutual information. C. Similarity or overlap in significant interactions, measured by Jaccard Index, called by FitHiC from Gm12878 Hi-C data using different restriction enzymes (HindIII, DpnII, MboI) and smoothed by GRiNCH (HindIII smoothed, DpnII smoothed, MboI smoothed).    Figure 7A in main text). Interactions counts were log2-transformed for better visualization. The boxes represent TAD boundaries.  Figure 7B in main text). Interactions counts were log2-transformed for better visualization. The boxes represent TAD boundaries.   Figure S14: Characterizing GRiNCH clusters of different size scales, i.e., subTADs, TADs, metaTADs. Shown are different statistics for different settings of the number of clusters in GRiNCH, k. We set k based on the expected size of the clusters and consider three expected sizes: subTADs (500kb), TADs (1MB), metaTADs (2MB). TADs are known to be ∼1MB and therefore, for the "TAD scale" regions, k = nc 1Mb , where n c is the length of chromosome c. A. The number of subTADs, TADs, and metaTADs, and their median size for Hi-C datasets from five cells lines at three different resolutions: 10kb, 25kb, 50kb. B. Proportion of subTADs, TADs, or metaTADs with significantly better (p-val < 0.05) cluster quality metrics than random clusters, as measured by Davies-Bouldin Index and Delta Count. Results here are shown for 25kb-resolution data. C. The similarity between subTADs, TADs, and metaTADs from high-depth GM12878 dataset and those from datasets downsampled to other lower depths available in different cell lines. Similarity is measured by Rand Index and Mutual Information. Results here are shown for 25kb-resolution data. Figure S15: Enrichment of regulatory signals in GRiNCH clusters of different expected sizes: subTAD (500kb), TAD (1Mb) , and metaTAD (2Mb). The expected size is used to set the GRiNCH parameter k, the number of clusters. Results are shown for 25kb resolution data. A. Fold enrichment of architectural protein binding signals in subTAD, TAD, and metaTAD boundaries. B. Proportion of subTADs, TADs, and metaTADs with significantly higher values (p-val < 0.05) of mean histone modifications compared to random clusters. Figure S16: GRiNCH TAD size distribution by graph regularization parameters A. λ and B. neighborhood radius. λ = 0 and neighborhood radius of "N/A" correspond to no regularization. Shown are distributions for different settings of the number of clusters in GRiNCH, k. We set k based on the expected size of the clusters and consider three expected sizes: subTADs (500kb), TADs (1MB), metaTADs (2MB). TADs are known to be ∼1MB and therefore, for the "TAD scale" regions, k = nc 1Mb , where n c is the length of chromosome c. Figure S17: Memory consumption and runtime trend of the GRiNCH algorithm. A. Memory consumption plotted against the input Hi-C matrix size (determined by the size of the chromosome and Hi-C resolution). Each point represents the maximum resident set size for a run of GRiNCH for a given input matrix size. The GRiNCH scales (subTAD, TAD, and metaTAD) represent the chromosome-specific k parameter value, which is set based on the expected size of an output TAD/cluster (500kb, 1Mb, 2Mb respectively); for the same input matrix size, metaTAD setting uses k value that is half of the TAD setting and TAD uses k half of subTAD setting. For a given matrix size and k combination, GRiNCH was run with every combination of regularization parameters λ ∈ {1, 10, 100, 100} and neighborhood radius ∈ { 25kb, 50kb, 100kb, 250kb, 500kb, 1Mb } as well as without any regularization (λ = 0). These runs were completed across a distributed computing platform with machines of varying computing power. B. Runtime distribution of GRiNCH against the input Hi-C matrix size. GRiNCH runtime was measured for different matrix sizes, TAD scales, and regularization parameters (see A), from start to successful termination of program. The dark line in the middle represents the median runtime of GRiNCH. The top of the shaded area represents the 75th percentile, the bottom 25th percentile of runtimes. Figure S18: Comparison of NNDSVD initialization versus random initialization. A. Loss as a function of iterations of the GRiNCH algorithm with NNDSVD initialization versus random initialization. For each type of initialization, we tracked the value of the objective/loss over 50 iterations of the GRiNCH algorithm, using 100 different seeds for randomization. The solid line represent the mean loss at a given iteration, and the lightly colored bands around each line is the standard deviation of the loss at the given iteration. B. The distribution of the loss after 50 iterations for each type of initialization. Each point on the box plot corresponds to one of 100 different random seeds. C. Measuring stability of GRiNCH TADs from NNDSVD initialization versus random initialization. Each box plot shows the distribution of similarity score of GRiNCH TADs using different initialization methods. GRiNCH TADs after 50 iterations from each run with different seeds and initialization methods were converted to clusters. Pairwise similarity of clustering results from seed x and seed y using the same initialization method was measured with mutual information. Higher mutual information means the GRiNCH TADs tend to be more similar and therefore more stable across different seeds. D. Measuring stability of GRiNCH TADs with Rand Index. GRiNCH TADs were generated in the same as in C. Higher Rand index means the GRiNCH TADs tend to be more similar and therefore more stable across different seeds. Table S1(C) Ranking TAD-calling methods by mean (absolute) change in median TAD size as input data resolution changes from 10kb to 25kb to 50kb

List of Supplementary Tables
Table S1(D) Ranking TAD-calling methods by mean Rand Index between TADs from two different resolutions of Hi-C data.
Table S1(E) Ranking TAD-calling methods by mean Mutual Information between TADs from two different resolutions of Hi-C data.
Table S1(F) Ranking TAD-calling methods by mean Rand Index between TADs from high-depth Gm12878 data and TADs from Gm12878 data downsampled to other cell lines' depth.
Table S1(G) Ranking TAD-calling methods by mean mutual information between TADs from high-depth Gm12878 data and TADs from Gm12878 data downsampled to other cell lines' depth.
Table S1(I) Ranking TAD-calling methods by mean proportion of TADs with significant histone modification signals across 5 cell lines