Hierarchical Domain Structure Reveals the Divergence of Activity among TADs and Boundaries

Mammalian genomes are organized into different levels. As a fundamental structural unit, Topologically Associating Domains (TADs) play a key role in gene regulation. Recent studies showed that hierarchical structures are present in TADs. Precise identification of hierarchical TAD structures however remains a challenging task. We present OnTAD, an Optimal Nested TAD caller from Hi-C data, to identify hierarchical TADs. Systematical comparison with existing methods shows that OnTAD has significantly improved accuracy, reproducibility and running speed. OnTAD reveals new biological insights on the role of different TAD levels and boundary usage in gene regulation and the loop extrusion model. The OnTAD is available at: https://github.com/anlin00007/OnTAD


8
TAD structures however remains a challenging task. We present OnTAD, an Optimal Nested 1 9 TAD caller from Hi-C data, to identify hierarchical TADs. Systematical comparison with existing 2 0 methods shows that OnTAD has significantly improved accuracy, reproducibility and running Previous studies have shown that human genome is highly compacted and organized at 2 7 different levels in nucleus, and the spatial organization of chromatin is essential for gene TADs, respectively. We observed that the active epigenetic states are increasingly enriched 1 9 0 along the levels of TADs (p-value of correlation < 2.2e-16) (Figure 3c,d). In comparison, 1 9 1 singletons are noticeably less active compared with hierarchical TADs (especially level >2).

9 2
Similar pattern was also observed in other cell types (K562 and Huvec) (Supplementary Figure   1 9 3 4). Taken together, our results showed that hierarchical TADs are on average more active than 1 9 4 singletons, and within hierarchical TADs, inner TADs (e.g., sub TADs) are more active than 1 9 5 outer TADs. 1 9 6 1 9 7 Active genes in hierarchical TADs 1 9 8 We further investigated how gene expression is associated with TAD hierarchies. Using the 1 9 9 RNA-seq data of GM12878 from ENCODE consortium (www.encodeproject.org), we calculated 2 0 0 the density of the number of expressed genes (FPKM > 5) within TADs at each level, which is 2 0 1 defined as the number of expressed genes per bin (i.e., 10Kb region). For genes covered by 2 0 2 multiple TADs, we associated them with the innermost TADs. We found that as the TAD level 2 0 3 increases, the density of the number of expressed genes also increases, i.e., genes are more It has been reported that TAD boundaries are interaction hotspots and are highly active [19]. We 2 1 0 also observed that, for TADs at all levels, the number of expressed genes and the enrichment of

1 6
For hierarchical TADs, we observed that their boundaries are frequently shared by 2 1 7 multiple TADs in a common hierarchical branch. Interestingly, for boundaries shared by more 2 1 8 than three TADs on either side, we observed that the numbers of TADs on the two sides of the 2 1 9 boundaries are significantly different, showing a preference of asymmetric hierarchical 2 2 0 structures over symmetric structures (Supplementary Table 2, Chi-Square p-value < 1e-10). We  We checked the enrichment of active epigenetic states at different boundary levels. We noticed 2 3 0 a significant positive linear correlation between the enrichment of active epigenetic states and 2 3 1 the number of times each boundary is shared (e.g., p-value = 7.43e-05 for Tss, 0.00204 for 2 3 2 TssCtcf and 0.001215 for Enh) (Figure 4b). We further studied the relationship between gene 2 3 3 expression level and the boundary sharing. Again, we observed a significant positive linear 2 3 4 correlation (p-value = 0.007) between the number of times a boundary is shared and the gene 2 3 5 expression level. In particular, the gene expression level at the boundaries that are shared by 5 shared by 5 or more TADs as super boundaries. We posit that super boundaries are more 2 3 8 active than boundaries shared by one or two TADs. We next asked if the observed asymmetry in boundary sharing is associated with the 2 4 2 mechanism of loop formation. A recent study in yeast suggests that loops are formed in an 2 4 3 asymmetric process with the cohesin anchors on one side and reels from another side [21]. One 2 4 4 possibility of asymmetric boundary usage is that loop extrusion may be stopped by some 2 4 5 mechanism related to CTCF or other architectural protein's binding orientation at the boundaries.

4 6
In such case, stopping of loop extrusion will depend on genomic location, and when loop 2 4 7 extrusion stops, it will result in boundary usage asymmetry on the two sides. This may also 2 4 8 explain asymmetric loop extrusion observed in experiments. It is known that the anchor sites in 2 4 9 yeast are supported by Ycg1 HEAT-repeat and Brn1 kleisin subunits [22]. In human, however, 2 5 0 little is known about the proteins supporting the anchor sites. We therefore performed 2 5 1 transcription factor (TF) enrichment analysis of 161 TF ChIP-seq data from ENCODE 2 5 2 consortium at different levels of TAD boundaries. We found that many structural-related TFs,  unclear how the TAD hierarchies are involved in gene regulation mechanisms. This is partly due 2 6 0 to the lack of a TAD calling method that can systematically identify all TAD hierarchies from Hi-2 6 1 C data. Here we introduce OnTAD, a new method to uncover the hierarchical TAD structures 2 6 2 from Hi-C data with substantially improved sensitivity and specificity than existing methods.

6 3
OnTAD yields optimal solutions with respect to its scoring function, subject to the accuracy of 2 6 4 the pre-selected sets of candidate TAD boundaries identified by its local minimum-searching 2 6 5 algorithm. Our comprehensive evaluation shows that OnTAD substantially outperforms the 2 6 6 existing tested TAD calling methods in accuracy, reproducibility and running speed. Importantly, denote a sub-matrix of X. A candidate TAD between bins a and b corresponds to a diagonal where the mean of the entries in X [a,b] is expected to be higher 3 1 5 than that in its neighboring matrices. Because of the distance dependency in Hi-C data, i.e., the 3 1 6 dependence of contact frequency on the proximity of the interaction loci, we normalize the Hi-C 3 1 7 matrix before TAD calling by subtracting the mean counts at each distance from the original 3 1 8 contact matrix. We develop a TAD calling algorithm to assemble TADs from the candidate boundaries. Several 3 2 2 issues need to be considered in the design of the algorithm in order to produce biologically 3 2 3 meaningful TADs. First, because a region may be shared by multiple candidate TADs, the 3 2 4 scores of these TADs can be strongly correlated. Second, in the TADs with nested structures, 3 2 5 the scores of the TADs and their nested sub-TADs are convoluted. Third, some boundaries may 3 2 6 be shared between TADs due to the biological mechanisms of loop formation. Last, the 3 2 7 algorithm needs to be computationally efficient to call TADs in the genome scale.

2 8
To address these issues, we developed a recursive algorithm to identify the set of TADs 3 2 9 that gives the optimal partition of the genome according to a scoring function g(X) (see the next 3 3 0 section). Our algorithm assumes that any given two TADs are either completely non-overlapping 3 3 1 (but can share boundaries) or completely overlapping (i.e. one TAD is nested within the other).

2
While this assumption sometimes may not be true, it greatly reduces the complexity of the 3 3 3 problem while still enabling us to 1) de-convolute nested TAD structures, 2) impose shared 3 3 4 boundaries, and 3) obtain an efficient algorithmic solution. When it is violated, i.e., the 3 3 5 boundaries of the TADs cross with each other, our method can still produce a reasonable 3 3 6 approximation ( Supplementary Fig.1C).

7
Briefly, the algorithm works as follows. Given a matrix X [a,b] , the algorithm starts at the 3 3 8 root level to first find the best bin i (a≤i<b) to partition the matrix into two submatrices, X [a,i] and 3 3 9 X [i,b] , such that X [i,b] is the largest right-most TAD in X [a,b] . Since X [a,i] and X [i,b] are disjointed, the 3 4 0 TADs within each submatrix can be called separately in a recursive manner. At each recursive 3 4 1 step, the parent matrix is partitioned into two sub-matrices, and TADs are called within each 3 4 2 sub-matrix using the same recursive formula (Supplementary Fig.2A). The recursion stops when 3 4 3 i=a, i.e., the sub-matrix X [a,i] Here, is the score of TADs within X [a,b] , not including the score for X [a,b] itself being a 3 5 5 TAD. It is calculated by finding the best left boundary of the largest right-most TAD in X [a,b] .
is the score of the largest right-most TAD in X [a,b] . It is the sum of the score of TADs diagonal block matrix to be called a TAD, its mean signal is required to be greater than the 3 5 9 means of its neighboring regions on both sides. We therefore define where m(X [i,b] |sub TADs) denotes the mean of X [i,b] , excluding the TADs within X [i,b] , as returned contact frequency. For each genomic distance, the TAD-adjR 2 is defined as         ). b, Two real TADs (a,c) and (b,c) are nested, which makes the score of (a,c) convoluted with the score of (b,c). c, Real TADs (a,c) and (b,d) are partially overlapping, which may be recaptured as nested TADs (b,c), (a,c) and (a,d).

a b c
Supplementary figure 2 | Illustration of the recursive TAD calling algorithm. a, Starting from the entire matrix, we partition the matrix into two matrices: the one that potentially forms the largest right-most TAD (triangles marked in black), and the remaining part. We call the same function on each sub-matrix to recursively identify nested TAD structures, and the best partition is determined by a scoring function. b, Each recursion step identifies the best set of TADs in its matrix under consideration, and return the TAD calls back to its parent until the root.

a b
Supplementary figure 3 | TAD reproducibility under different measurements. a, Adjusted rand index between TADs from two biological replicates (GM12878, 10Kb). b, Adjusted rand index across TADs from Hi-C data in multiple resolutions (GM12878, 5Kb, 10Kb and 25Kb). We excluded TADtree as it took too much computing resources on high resolution data. c, Adjusted rand index between TADs from Hi-C data in original sequencing depth and in different down sampled sequencing depth (GM12878, 1/4, 1/8, 1/16 and 1/32 of original sequencing depth ). All comparisons are calculated on autochromosomes individually (excluding chr1 and chr9 due to no results produced by some methods).