TAD identification across data resolutions and normalizations
Data generated by high-throughput sequencing of chromatin conformation capture experiments (Hi-C) is typically organized in contact matrices, where interactions are grouped into bins of a predefined size and the numbers of contacts between pairs of bins constitute the entries of these matrices [1]. While the choice of the bin size is somewhat arbitrary and dependent of the sequencing depth, proposed heuristics select the minimum size such that the total number of reads per bin is deemed sufficient for a robust estimation of the contacts between each pair of loci [2]. Small bin sizes are possible only with high sequencing depth and allow for high-resolution analyses of chromatin loops. Conversely, large bin sizes might be used to study large chromatin structural elements (e.g., compartments [2, 5]) or are the necessary choice when a low number of reads are available leading to low-resolution matrices. As a consequence, the bin size of a Hi-C matrix reflects its resolution. Furthermore, values within such contact matrices are typically normalized to account for technical and experimental biases such as variable sequencing depth, GC content, fragment length, and sequence mappability. For the purpose of our analyses, we generated contact matrices for chromosome 6 of the GM12878 cell line with 4 different resolutions by subsampling reads from the original dataset such that the minimal bin size that guaranteed 1000 reads in at least 80% of the bins [2] was either 10 kb, 50 kb, 100 kb, or 250 kb. Then, we normalized each contact matrix using two popular approaches: the iterative correction and eigenvector decomposition [39] (ICE) and a parametric model of local genomic features [40] (LGF, a.k.a. HiCNorm). In total, we tested 22 callers, each across 8 different conditions (Fig. 1a).
First, we examined the number of TADs and their average size obtained with each TAD caller at each matrix resolution using ICE-normalized data. Not all callers successfully completed their run at all resolutions, with certain callers not performing well either with the smallest bin size (chromoR, HiCExplorer, HiTAD, and MrTADFinder) or with the largest bin size (arrowhead and PSYCHIC) (Fig. 1b, c—gray boxes). Interestingly, for a given bin size, the number of identified TADs could vary of up to two orders of magnitude among callers (Fig. 1b). Consequently, so did vary the average TAD size measured in number of base pairs (Fig. 1c). Similar results were obtained with LGF-normalized data (Additional file 2: Figure S1a). For both normalization procedures and examining TAD callers that returned results at all resolutions (n = 17), we found that the mean TAD size was generally increasing with increasing bin size (Fig. 1d and Additional file 2: Figure S1b). Conversely, the size of TADs measured in number of bins was relatively stable (Fig. 1e and Additional file 2: Figure S1c) for most of the callers, suggesting these callers tend to call TADs with the same number of bins, rather than corresponding to the same genomic region. In particular, the growth trends exhibited by TAD size as a function of bin size were remarkably similar independent of the normalization strategy adopted (Additional file 2: Figure S1d).
The slopes corresponding to the growth of the TAD size, measured by number of either base pairs or bins, based on the bin size adopted confirmed this tendency among all callers (Fig. 1f). Indeed, for almost all callers, we found positive slopes when the TAD size was measured in base pairs, whereas slopes were negative and close to zero when the TAD size was measured in number of bins. Exceptions were the Directionality Index (DI) and MrTADFinder, although TAD sizes estimated by the former did not vary monotonically and the latter performed poorly with bin size equal to 10 kb affecting the slope estimation. Interestingly, tools based on a linear score (Fig. 1f, red dots) had a greater tendency for identifying TADs with the same number of bins, rather than covering the same genomic regions, than tools relying on clustering-based approaches or statistical models (Fig. 1f, yellow and green dots, respectively).
Given different bin sizes corresponded to Hi-C contact matrices with a different number of reads, i.e., different resolutions, we wondered whether TADs observed in these different matrices reflected nested domains, such that the smallest domains are detectable with small bin sizes, whereas only large domains comprising the smallest ones can be detected with large bin sizes. If so, then TAD boundaries detected with large bin sizes must be a subset of the boundaries detected with small bin sizes (Additional file 2: Figure S2a). We tested this hypothesis for all TAD callers and computed the fraction of boundaries detected with a given bin size that contained at least one of the boundaries detected with the bin size immediately smaller (Additional file 2: Figure S2b). The results were overall variable among callers, but for several, we found that at least 50% of the boundaries were retained between resolutions. Notably, although arrowhead, GMAP, and PSYCHIC detect by default nested or overlapping TADs, their rate of boundary conservation was typically below 50% (Additional file 2: Figure S2b).
The presence of nested TADs can therefore explain, at least in part, the highly variable TAD size observed with different bin sizes. Nonetheless, boundary conservation was often low (below 50%) and size variability among callers for a given bin size remained pronounced. Overall, these findings challenge the possibility of defining a “typical TAD size” and indicate that the average size of a TAD is largely dependent of the tool used to identify these domains and resolution of the contact matrix.
To assess the similarity between TADs identified by the same caller using different normalizations or bin sizes, we recurred to the Measure of Concordance (MoC), previously introduced to compare clustering partitions [41]. Briefly, given two sets of TADs, MoC assesses the overlap between each pair of TADs, measured in number of base pairs and considering the overall size of both TADs (Additional file 2: Figure S2c). MoC ranges from 0, complete lack of concordance, to 1, perfect concordance, and it has the desirable property of being symmetric.
First, we compared TADs determined by each caller at a given bin size using ICE or LGF normalization strategy. Overall, TADs were often highly concordant between normalizations (MoC > 0.75) for about half of the callers (Fig. 2a), although results from ClusterTAD, spectral, chromoR, and matryoshka seemed particularly sensitive to the applied normalization strategy. We could not run 3DNetMod on LGF-normalized data, hence could not make the comparison for this caller. The concordance between TADs determined at different bin sizes was lower than it was between normalization strategies in all callers (Fig. 2b). In particular, as the ratio between the compared bin sizes increased, the concordance between the resulting sets of TAD decreased and almost invariably fell below 0.5 for five- to ten-fold bin size ratios (Fig. 2b). MoC values obtained by each caller between normalization strategies correlated with values obtained between different bin sizes (Fig. 2c). Overall, HiTAD, CHDF, CaTCH, TopDom, TADbit, PSYCHIC, and HiCseg demonstrated robust and consistent TAD partitions independent of the normalization and bin size adopted (Fig. 2c, top right corner).
Given many of these approaches rely on different parameters, we wondered how much the choice of these parameters influences the results. We should note that all callers have here been run using recommended parameters or default ones if no recommendations were made (see Additional file 1). For a subset of them, we compared the concordance between TADs identified using recommended or other parameters, but same bin size. Overall, most tools returned concordant TAD partitions (MoC > 0.7, Additional file 3: Table S1), with the exception of the Directionality Index (MoC > 0.6 for two independent comparisons), and Insulation Score (MoC = 0.32 and MoC = 0.4 for two independent comparisons). Both DI and IS have window size parameters that are highly dependent on the bin size adopted and, indeed, in our analyses we tuned these parameters when changing bin size (Additional file 1). These tests seem to indicate that parameter optimization is not a strict requirement for most tools. However, testing the robustness of a caller to its parameters might sometimes be necessary and callers with few or no parameters are preferable in this regard (see Table 1).
The ability of a caller to robustly perform independent of the resolution and quality of the data is highly desirable, as Hi-C experiments with high sequencing depth remain costly and, thus, difficult to scale to large datasets. To specifically assess the performance of all callers when different numbers of reads are available, independent of the bin size of the matrix, we systematically subsampled Hi-C contacts determined for the GM12878 cell line (chromosome 6) and compared the TADs called with subsampled reads with those obtained from the complete contact matrix. For this test, we used ICE-normalized data binned in 50-kb intervals and generated 13 different matrices using different percentages of reads that ranged from 100 to 0.01%, corresponding to an estimated cost ranging from over $1000 to less than a dollar (Fig. 3 top panel). Some of the tested callers failed to return a TAD partition for low sequencing depth, in particular arrowhead and HiTAD did not identify TADs with less than 1% of the reads (Additional file 2: Figure S3). Interestingly, the number of TADs and TAD size determined by each caller across subsampling (Additional file 2: Figure S3) were nonetheless less variable then they were when changing the bin size of the matrix (Fig. 1b, c). In agreement with our results across normalizations and bin sizes, TopDom, TADbit, HiCseg, and CaTCH all demonstrated highly concordant results: TADs generated with these callers using the full set of contacts were highly reproducible using less than 1% of the reads (MoC > 0.75) (Fig. 3). Notably, the best performance was here achieved by the Insulation Score (IS) that obtained MoC values > 0.75 with up to 0.2% of the total number of reads (Fig. 3) in agreement with recent analyses [42]. Using these callers would therefore lead to similar results at a fraction of the cost.
Comparison of TADs identified by different callers
Up to this point, we have evaluated the concordance and robustness of the results obtained by each individual caller using different bin sizes and normalization strategies. Next, we compared the TADs obtained using different callers to assess their concordance. To this purpose, we fixed the bin size to 10 kb, corresponding to the matrices with the highest resolution, and used only ICE-normalized data, given the high concordance of results obtained with two normalization strategies for most callers. Using these settings, we could compare results from 18 callers.
First, we explored how frequently the same TAD or TAD boundary was called by different TAD callers (shared boundaries/TADs), using variable minimum distances between boundaries to determine shared TADs and TAD boundaries. Across all TAD callers, most of the boundaries were detected by less than half of the methods (Fig. 4a), and even less shared were TADs, with the highest fraction of them being detected by less than 4 callers (Fig. 4b). As expected, with increasing minimum distance, the fraction of shared TADs and TAD boundaries increased with distributions centered around 8 and 4 callers, respectively, for a minimum distance of 5 bins (± 50 kb). TADs and TAD boundaries detected by each caller exhibited different extent of agreement. Imposing a minimum distance of 2 bins (± 20 kb), most of the callers found at least 50% of boundaries that were also detected by more than 5 other callers (Fig. 4c) and this percentage was greater than 80% for the Directionality Index approach (DI), GMAP, TADbit, and arrowhead. Conversely, the majority of the boundaries called by matryoshka, armatus, PSYCHIC, spectral, and 3DNetMod were detected by less than 5 callers. It should be noted that, with the exception of PSYCHIC, these callers also identify a large number of boundaries (Fig. 4c). Consistently, DI, GMAP, arrowhead, and TADbit also identify the highest proportion of TADs that were also called by more than 5 other callers, whereas the majority of TADs called by IS and ClusterTAD were never detected by the other callers (Fig. 4d).
Next, we compared results from each pair of callers by Measure of Concordance (MoC). Average pairwise MoC values identify CHDF, TopDom, HiCseg, ICFinder, and CaTCH as the callers with highest mean MoC (MoC > 0.4) (Fig. 4e). Notably, these callers also scored among the most robust across bin sizes and normalization strategies (Fig. 2c–e). Stochastic neighbor embedding analysis (t-SNE [43]) of MoC values obtained by each caller against the others revealed three major groups of callers exhibiting high MoC values within group and low among groups (Fig. 4f). Callers within each group were often based on different algorithmic strategies; nonetheless, the number of TADs found by each caller was similar within group, but significantly different among them (Fig. 4g). Highly correlated pairwise MoC values were detected when comparing TAD partitions determined with bin size of 50 kb (Pearson’s correlation = 0.6, p value = 5E−33). To assess the concordance among TADs called by different methods in an independent manner, we computed the ratio between conserved TADs among each pair of callers and the minimum total number of TADs found by the two callers, i.e., the maximum possible number of TAD that can be conserved between the pair. The resulting mean conservation ratios were correlated with the mean MoC for all callers except for armatus, matryoshka, and arrowhead, which were the top three scoring based on this new analysis (Additional file 2: Figure S4a). Interestingly, the former two found by far the largest number of TADs; thus, even a small fraction of conserved TADs could still represent a relatively large number of domains, whereas arrowhead found the smallest number of TADs, indicating a high fraction of these domains were also called by other methods. Overall, these findings provide a reference to anticipate the concordance of the results based on specific TAD partitions depending on the adopted caller.
TAD enrichment for CTCF and cohesin binding and histone methylation marks
While robustness and concordance are important features to assess the performance of a given algorithmic approach, the quality of its output is ultimately assessed by whether the results recapitulate true features of the system being analyzed. However, here, we lack a “true” set of TADs that could be used as a reference. Furthermore, the highly variable number of TADs and TAD sizes across callers challenge the design of synthetic datasets where a true TAD partition is pre-defined. To overcome these limitations, we evaluated the biological relevance of the TADs identified by each TAD caller, by assessing specific biological features that have been found frequently associated with TADs and/or TAD boundaries.
The chromatin insulator protein CTCF and the cohesin complex have been frequently reported at TAD boundaries, and they seem to be required for boundary formation [2, 6, 44] (Fig. 5a). Thus, we tested whether TAD boundaries identified by each caller were enriched for binding of these proteins. Peaks for CTCF, RAD21, and SMC3 were determined from chromatin immuno-precipitation followed by high-throughput sequencing (ChIP-seq) data and we explored the intensity of these peaks at domain boundaries and flanking regions. For some callers, e.g., TopDom, peaks of CTCF, RAD21, and SMC3 were pronounced at domain boundaries (Fig. 5b), whereas for others, e.g., spectral, peaks of the 3 proteins did not show any association with the TAD boundaries identified by the caller (Fig. 5c). To quantify these trends, we computed the percentage of TAD boundaries that were tagged by either of the 3 proteins (Additional file 2: Figure S4b) and the fold change between peaks found at TAD boundaries and those at adjacent flanking regions (Fig. 5d). Results from these analyses were correlated and consistent for all 3 proteins and identified arrowhead, DI, TopDom, and CHDF among the top 5 callers based on both criteria, with greater than twofold increase of CTCF binding signal at TAD boundaries and ~ 40% or more boundaries tagged by either CTCF or cohesin. Vice versa, matryoshka, ClusterTAD, armatus, and spectral always scored at the bottom with fewer than 20% of the boundaries tagged by either CTCF or cohesin and exhibiting no different ChIP-seq peak intensities between boundaries and flanking DNA regions (Additional file 2: Figure S4b and Fig. 5d). Given the variable level of agreement among TAD boundaries called by different methods, we tested CTCF enrichment at boundaries called by at least 50% of the callers (n = 9) compared to boundaries called by less than half of them. For all methods, we found a dramatic increase of CTCF fold changes for boundaries called by 50% or more callers (Fig. 5e) and this trend increases with the minimum number of callers consistently calling a boundary (Fig. 5f). Overall, shared boundaries always exhibited higher CTCF ChIP-seq signal than adjacent regions and the more callers identified a boundary, the higher the fold change of the signal (Fig. 5f).
Given some TAD callers can identify hierarchy of TADs or overlapping TADs (not necessarily nested—see Table 1), we investigated CTCF binding at TAD boundaries based on their nesting level. For this analysis, we only retained perfectly nested TADs and nesting levels comprising at least 10 TADs. Moreover, we define as the “lowest” nesting level (or L1), the set of TADs that do not contain any other nested TAD. In general, CTCF signal fold changes increased with the level of nesting, especially for tools that at the lowest level identified a very high number of small TADs (armatus, matryoshka, and 3DNetMod) and was positively correlated with the TAD mean size (Pearson’s correlation = 0.37). However, for mean sizes between 250 kb and 1.25 Mb, this correlation was no longer observed (Pearson’s correlation = 0.05) and results were thus independent of the TAD size (Fig. 5g—gray area). With increasing nesting, the number of TADs identified by each caller rapidly decreased. This might be a desirable feature for tools such as armatus and matryoshka that initially identified > 2500 TADs each. For these approaches, higher nesting levels led to a smaller number of TADs better associated with CTCF binding, even though CTCF fold changes remained consistently smaller than those observed for top scoring tools, such as arrowhead, DI, and TopDom. On the other hand, best CTCF fold change for GMAP and CaTCH (both obtained at L2—Fig. 5g) were derived from 87 TADs (mean size = 1 Mb) and 34 TADs (mean size = 550 kb), respectively, suggesting that these are not representative of the entire chromosome and lower nesting levels should also be considered to explore the full organization of the chromatin into TADs.
Histone-3 methylation marks are indicative of transcriptional activity within specific regions of the genome. Interestingly, it has been shown that TADs are frequently enriched for either activating (H3K36me3) or repressing (H3K27me3) marks [2, 7]. Hence, we explored the ratio between H3K27me3 and H3K36me3 within each TAD found by each caller and determined the percentage of TADs that exhibited a significant enrichment for either mark (empirical FDR < 0.1, see the “Methods” section) (Fig. 5h). Arrowhead was the top scoring approach also in this analysis, indicating that domains identified by this caller most frequently exhibit TAD-associated biological features (Fig. 5i). 3DNetMod, GMAP, CHDF, and HiCseg all identified more than 50% of TADs with a significant enrichment for either H3K36me3 or H3K27me3, whereas EAST, ClusterTAD, PSYCHIC, spectral, and TADbit scored at the bottom of this analysis, with less than 10% of the identified domains enriched for either of the histone marks (Fig. 5i). The association between TADs and H3K27me3 and H3K36me3 was largely independent of the nesting levels, for tools that identified nested TADs (Additional file 2: Figure S4c).
Overall, the results from these analyses based on known TAD-associated biological features were correlated, yet several tools scored well in one and poorly in the other (Additional file 2: Figure S4d). Arrowhead always scored at the top, but also GMAP, CHDF, TopDom, DI, HiCseg, and ICFinder exhibited a good enrichment for both CTCF/cohesin binding and histone mark specificity (Additional file 2: Figure S4d) and these results were confirmed even when accounting for nested TAD structures found by specific tools.
Validation on independent chromosomes and datasets
All the analyses presented so far have been run on chromosome 6 of the GM12878 cell line. To test the reproducibility of our results on additional chromosomes, we repeated key analyses on chromosomes 1, 3, 13, and 15, chosen to include both large and short chromosomes with high sequencing depth. Specifically, we correlated the results obtained on chromosome 6 with those derived on the other chromosomes for the following analyses: mean TAD size, TAD concordance between normalizations (ICE vs. LGF), TAD concordance among different bin sizes, TAD concordance among different callers, enrichment of CTCF and cohesin peaks at TAD boundaries, and H3K27me3/H3K36me3 ratios within TADs. Across all analyses, we could confirm highly correlated results (Additional file 2: Figure S5), indicating that these are robust with respect to and independent of the chosen chromosome.
To further corroborate our results, we compared TAD and TAD features derived from chromosome 6 of the GM12878 cell line with four distinct datasets: two independent replicates of the GM12878 model and two independent Hi-C datasets including human fetal fibroblasts (IMR90 cell line) and mouse cortical neurons (MCN) [45]. For these analyses, we selected a subset of 6 callers that exhibited variable performance across our tests: arrowhead, CaTCH, HiCseg, ICFinder, spectral, and TopDom. First, we compared the concordance (MoC) between TADs identified by each caller in the two independent GM12878 replicates. These datasets were separately generated, but derived from the same cell line; hence, a high level of concordance is expected. Indeed, all callers, except spectral, generated TADs with MoC greater than 0.7; for CaTCH, HiCseg, and TopDom, it was greater than 0.8. Next, we used the results derived from the complete GM12878 dataset as a reference and separately compared results from the GM12878 replicates, IMR90, and MCN against this reference. The mean TAD size identified by each caller was highly consistent between each pair of datasets, both using 10-kb and 50-kb bins, except for arrowhead that found bigger TADs in MCN compared to GM12878 (Fig. 6a, b). Nonetheless, the overall set of results were highly correlated, with spectral always identifying the largest number of TADs with smallest mean size, and arrowhead the smallest number with the largest size. Similarly, the concordance between TAD partitions determined by each pair of callers was remarkably consistent between GM12878 and the four other datasets (Fig. 6c). Finally, we compared CTCF binding fold changes and the ratio between H3K27me3 and H3K36me3 within TADs determined in the different experiments. Results for CTCF fold changes were highly correlated (Fig. 6d), whereas this correlation was weaker for the comparisons of histone mark ratios with IMR90 and MCN (Fig. 6e), potentially reflecting distinct specific histone mark changes associated with different cell types and species. Overall, results from independent replicates and datasets were highly concordant with those obtained for the GM12878, indicating that the performance of each method was largely independent of the analyzed dataset.