CRUP: a comprehensive framework to predict condition-specific regulatory units

We present the software Condition-specific Regulatory Units Prediction (CRUP) to infer from epigenetic marks a list of regulatory units consisting of dynamically changing enhancers with their target genes. The workflow consists of a novel pre-trained enhancer predictor that can be reliably applied across cell types and species, solely based on histone modification ChIP-seq data. Enhancers are subsequently assigned to different conditions and correlated with gene expression to derive regulatory units. We thoroughly test and then apply CRUP to a rheumatoid arthritis model, identifying enhancer-gene pairs comprising known disease genes as well as new candidate genes.

Fig S1: Distributions of HM ChIP-seq and ATAC-seq read counts for 51, 453 called enhancer peaks in mESC + . A k-means clustering for k = 2 reveals two groups of enhancers (cluster 1 and 2), which both show an enrichment of the enhancer-typical HMs H3K4me1 and H3K27ac, as well as of an independent ATAC-seq experiment. The two clusters differ in enrichment intensity, possibly showing strong (cluster 1) and weak (cluster 2) enhancers. The promoter mark H3K4me3 is lower than H3K4me1 in both enhancer groups. Decreasing the probability cutoff in a step-wise manner from 0.5 to 1 leads to an increasing number of genome-wide predicted enhancer regions (x-axis) for which we computed the median distance to the closest ATAC-seq peak (y-axis) representing the spatial resolution of the predictions (described in Section 5.9). We used A) the summit of the ATAC-seq peaks and B) the 'NarrowPeak' output of the peak calling method as a baseline to compute the distances. We compare our results (CRUP-EP) with the results of REPTILE in several settings, which are explained in more detail in Section 5.12.  i.e., the probabilities with which each HM is found in each state. According to the combination of H3K27ac and H3K4me1 emission probabilities, we defined the enhancer states for each of the three segmentations. For two, three and four states, we could not clearly distinguish an enhancer state from a promoter state (high H3K27ac and high H3K4me3 emission probabilities). The performance results for the five and eight state ChromHMM model is depicted in Figure 2. In general, the performance for five to eight states is similar, while with increasing number of states the performance increases slightly (results not shown).

H3K4me1
H3K4me3 ratio and HM sample (rows). The Gini index is measured during the training of the individual random forsts: A) classifier 1 (active vs. inactive regions) B) classifier 2 (active enhancers vs. active promoters) C) combined random forest as described in Section 5.9. . We trained classifiers on 12 different samples according to our CRUP-EP framework (dark orange) and using a combined random forest variant (grey, 'single RF') as described in Section 5.9. Shown are the predicted probabilities of the active enhancers contained in the test set of each of the 12 samples.
Fig S10: AUC-ROC performance of CRUP for predictions across cell lines and species.

A)
95 95 95 97 96 95 96 96 95 93 96 97   96 96 96 97 96 95 96 96 95 96 96 97   97 97 97 97 97 96 97 97 97 98 94 96   98 97 97 98 96 96 97 97 97 97 90 93   97 97 97 98 97 97 97 97 97 97 94 96   97 97 97 97 97 97 97 97 97 97 94 96   97 97 96 98 97 97 97 97 97 97 94 96   97 97 97 97 97 97 97 97 97 97 94 95   97 97 97 99 96 97 97 97 98 96 93    Fig S13: AUC-ROC performance of REPTILE for predictions across cell lines and species. Two REPTILE classifiers were trained on and applied to samples from different cell types (hepatocyte, ESC, adipocyte, fibroblast) and species (mouse and human) based on A) FAN-TOM5 derived enhancers and three core mESC HMs and B) FANTOM5 derived enhancers, three core mESC HMs and intensity deviation features (described in Section 5.12). The results can be summarized in 12 × 12 heatmaps where each entry is shaded according to the computed AUC-ROC (in percent). The origin of the training data a)-l) can be found in the rows and the origin of test sets a-l in the columns. We also applied two pre-trained REPTILE classifiers, which arer based on C) three core mESC HMs and intensity deviation and D) three core mESC HMs, intensity deviation and differentially methylated regions, on the 12 test sets described above. The results are summarized in 1 × 12 AUC-ROC heatmaps.  quant_mESC_30  quant_mESC_40  quant_mESC_50  quant_mESC_60  quant_mESC_70  quant_mESC_80  quant_mESC_90  no_quant_mESC_10  no_quant_mESC_20  no_quant_mESC_30  no_quant_mESC_40  no_quant_mESC_50  no_quant_mESC_60  no_quant_mESC_70  no_quant_mESC_80  no_quant_mESC_90 Fig S14: Quantile normalization effects on the predicted probabilities after read depth reduction. Distributions of genome-wide predicted probabilities (in 100 bp bins). We used the CRUP-EP classifier trained on the mESC + sample and applied it to nine mESC + samples with a reduction of reads from 10 to 90 % ('mESC 10' corresponds to an mESC + sample where we the original number of reads is reduced to 10%) A) Predictions with quantile normalization. B) Predictions without quantile normalization. C) Distribution of absolute differences in genome-wide predicted probabilities. We computed the absolute difference between the predicted probabilities (per bin) of the original mESC + sample and each of the samples with reduced depth (i) with quantile normalization (solid line) and without quantile normalization (dotted line). We used the CRUP-EP classifier trained on the mESC + sample and applied it to (i) the mESC + sample (dark orange), (ii) the healthy fibroblast samples #1 and #2 using quantile normalization as described in Section 5.9 (orange and light orange) and to (iii) the healthy fibroblast samples #1 and #2 without applying quantile normalization (blue dotted lines). B) Distribution of absolute differences in genome-wide predicted probabilities. We computed the absolute difference between the predicted probabilities (per bin) of healthy fibroblast sample #1 and #2 in the scenarion (i) with quantile normalization (orange) and (ii) without quantile normalization (blue).  Motif enrichment for differential enhancers associated with retinoic acid signaling. We computed the motif enrichment for the differential enhancers which were grouped in cluster 1 (LIF; active enhancers only in the pluripotent state) and cluster 2 (RA; active enhancers only in RA-induced differentiation state). We further filtered for TFs which had an enrichment value of at least 1 in one of the clusters and in addition a difference in enrichment of at least 1 between both clusters, resulting in 85 TFs. From these, we do not depict 9 TFs which have a very similar motif to OCT4 (POU1F1, POU2F1/2, Pou2f3, POU3F1/2/3/4, POU5F1) and 6 TFs with a similar motif as TFAP2C (TFAP2A var.2/3, TFAP2B var.2/3, TFAP2C var.2/3) to avoid redundancy in the plot. Depicted are enrichment values for the two clusters and additionally, the log 2 -fold enrichment between cluster 1 and cluster 2.  S17: Differential (LIF-dependent) enhancer example. A) An example for a dynamic enhancer region (chr4 : 99640301 − 99643500, highlighted in blue) which was predicted by CRUP to be active in mESC + (LIF dependent) but not in mESC − (RA dependent). B) The predicted differential enhancer sequence was tested using an enhancer reporter assay (STARR-qPCR). The difference in the transcript levels of the GFP reporter between between mESC − (-LIF/+RA) and mESC + (+LIF/-RA) as well as compared to an untreated sample (-LIF/-RA) recapitulate the predicted dynamic activity. The LIF inducible viral enhancer CMV serves as a positive control. As a negative control we chose nc1, which is not active in mouse embryonic stem cells. Differential enhancers were grouped in cluster 1 (Rh. Arth.; active only in diseased samples) and cluster 2 (H; active only in healthy samples). We filtered for TFs with an enrichment value > 1 in one of the clusters and a difference in enrichment > 0.5 between both clusters, resulting in 80 TFs. From these, we do not depict 2 TFs which have a very similar motif to HOXA2 (HOXB2, HOXB3 ) and 3 TFs with a similar motif as HOXA13 (HOXB13, HOXD13, HOXC10 ) to avoid redundancy in the plot. We computed the log 2 -fold values between the enrichment of the two clusters, where red colours indicate a higher motif enrichment in cluster 1, and blue colours a higher enrichment in cluster 2. KLF4 and SPI1 (PU.1) (highlighted in orange) are known to play a role in rheumatoid arthritis, either as regulator of proinflammatory signaling (Luo et al., 2016)  Using 5-fold cross-validation we evaluated the enhancer classifier embedded in CRUP for 1600 different parameter settings in total which were derived from different choices for training seeds, number of neighboring windows (x-axis) and number of decision trees (shown in different colours): 10 · 16 · 10 = 1600. Shown is the area under the precision recall curve (y-axis) for A) classifier 1 (active vs. inactive regions), and B) classifier 2 (active enhancers vs. active promoters). Using 5-fold cross-validation we evaluated the enhancer classifier embedded in CRUP for 1600 different parameter settings in total which were derived from different choices for training seeds, number of neighboring windows (x-axis) and number of decision trees (shown in different colours): 10 · 16 · 10 = 1600. Shown is the area under the ROC curve (y-axis) for A) classifier 1 (active vs. inactive regions), and B) classifier 2 (active enhancers vs. active promoters). Fig S22: AUC-PR cross-validation results for parameter tuning in mESC using the extreme gradient boosting approach. Using 5-fold cross-validation we evaluated the enhancer classifier based on the XGBoost algorithm for 1600 different parameter settings in total which were derived from different choices for training seeds, number of neighboring windows (x-axis) and number of decision trees (shown in different colours): 10 · 16 · 10 = 1600. Shown is the area under the PR curve (y-axis) for A) classifier 1 (active vs. inactive regions), and B) classifier 2 (active enhancers vs. active promoters).

A) B)
Fig S23: AUC-ROC cross-validation results for parameter tuning in mESC using the extreme gradient boosting approach. Using 5-fold cross-validation we evaluated the enhancer classifier based on the XGBoost algorithm for 1600 different parameter settings in total which were derived from different choices for training seeds, number of neighboring windows (x-axis) and number of decision trees (shown in different colours): 10 · 16 · 10 = 1600. Shown is the area under the ROC curve (y-axis) for A) classifier 1 (active vs. inactive regions), and B) classifier 2 (active enhancers vs. active promoters).

A)
•  Fig S24: AUC-ROC and AUC-PR cross-validation results for CRUP-EP and the XG-Boost based approach. Using 5-fold cross-validation we evaluated the random forest based classification approach used in CRUP ('rf') and the classifier based on the XGBoost algorithm ('xgboost') for a parameter choice of 100 decision trees and 5 neighboring bins. Shown is the area under the PR curve over 10 different training seeds (y-axis) and 12 samples (x-axis) for A) classifier 1 (active vs. inactive regions), B) classifier 2 (active enhancers vs. active promoters), as well as the area under the ROC curve for the same setting for C) classifier 1 (active vs. inactive regions), and D) classifier 2 (active enhancers vs. active promoters).