FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer–promoter map

Recent sequencing technologies enable joint quantification of promoters and their enhancer regions, allowing inference of enhancer–promoter links. We show that current enhancer–promoter inference methods produce a high rate of false positive links. We introduce FOCS, a new inference method, and by benchmarking against ChIA-PET, HiChIP, and eQTL data show that it results in lower false discovery rates and at the same time higher inference power. By applying FOCS to 2630 samples taken from ENCODE, Roadmap Epigenomics, FANTOM5, and a new compendium of GRO-seq samples, we provide extensive enhancer–promotor maps (http://acgt.cs.tau.ac.il/focs). We illustrate the usability of our maps for deriving biological hypotheses. Electronic supplementary material The online version of this article (10.1186/s13059-018-1432-2) contains supplementary material, which is available to authorized users.


Supplementary Figure 14. Correlation between promoter DHS signal and gene expression.
We examined the correlation between DHS signal at promoters and gene expression levels using ENCODE cell lines for which both DHS and RNA-seq dataset were available (this included 11 cell-lines with polyA RNA-seq and 6 cell lines with total RNA-seq). In all cases, we observed high Spearman but low Pearson correlation indicating strong monotonic, non-linear relationship.  (*) E-P links whose E is located within an intron of a gene (not necessarily the target gene) (**) Number of Entrez genes associated with promoters ) was 39,892 before model selection (2) The number of OLS promoter models ( ) was 6,807 before model selection (3) The number of OLS promoter models ( ) was 1,951 before model selection (4) The number of OLS promoter models ( ) was 4,851 before model selection (*) Selected promoter models passed either both validation tests or the activity level test only (**) Selected promoter models passed either binary test and/or the activity level test

GRO-seq data preprocessing
We downloaded raw sequence data of 245 GRO-seq samples from the Gene Expression Omnibus (GEO) database (Additional file 3: Table S5). First, we applied read quality control on each profile using the Trimmomatic tool (default parameters) [1]. From each read we trimmed (1) bases from Illumina Tru-seq adapters, and (2) bases with low base quality scores from both ends. We excluded reads with net length <30 bases. Finally, we cropped each read to the first 30 bases from the 5' end. Second, we aligned the trimmed read to a set of known ribosomal RNA (rRNA) genes (FASTA sequences taken from NCBI: RN18S1, RN28S1, RN5, and RN5S17) using bowtie2 [2] (default parameters), and discarded reads aligned to rRNA genes. Third, we aligned the rest of the reads to hg19 reference genome using bowtie2 (default parameters). For subsequent analyses we used only reads that had a MAPQ score greater than 10. Fourth, we merged aligned reads from multiple profiles with the same sample id (via GEO GSM id) into a single sample. In total, our collected GRO-Seq database covered 40 studies encompassing 245 samples from 23 cell lines, each assayed under control and stress conditions (Additional file 3: Table S5).
We quantified gene transcription activity by counting the number of reads mapped within each (unspliced) gene. As gene models we used a single transcript per gene, constructed using groHMM's makeConsensusAnnotations function [3] and hg19 UCSC refGene table, producing 22,891 consensus genes. We only used reads mapped to the gene's transcript body in the range 0.5kb to 20kb downstream of the TSS. If the transcript's length was less than 20kb then we used only the region up to the transcript termination site (TTS).
To identify active enhancers in each sample, we applied dREG [4] on the aligned reads. dREG detects "transcriptional regulation elements" (TREs) based on symmetric forward and reverse read coverage relative to their center position. This symmetry is a known mark of short putative enhancers [5]. We merged overlapping TREs (taking the union of their locations) detected in different samples to create merged TREs (mTREs). We defined as enhancers mTREs that are either: (1) intergenic: mTREs whose center is located at least 5kb from the closest gene's TSS and does not overlap any gene's transcript body, or (2) intronic: mTREs that are not exonic and have overlap with an intron of a gene. We counted the number of reads in each intergenic enhancer (in both strands) and intronic enhancer (only in antisense strand) in each sample using BEDTools [6].
The gene and enhancer expression matrices were further filtered to include only genes/enhancers (rows) with at least one sample (columns) with RPKM 1, in order to preserve only expressed genes/enhancers. Next, to focus of the analysis on differential genes, we calculated for each the coefficient of variation (CoV) (the ratio between the gene's standard deviation to the mean ), and selected the most variable ones as follows: (1) we partitioned the genes according to their mean RPKM expression into 20 bins. (2) In each bin we retained the genes with CoV above the bin's median level. These two steps also reduce preference to highly or lowly expressed genes. The final gene matrix contained 8,360 genes, and the final enhancer matrix contained 255,925 enhancers.
We defined for each gene the set of k=10 candidate enhancers located within a window of ±500Kb from its TSS.

FOCS Model Implementation
The input to FOCS is two activity matrices, one for enhancers ( ) and the other for promoters ( ), measured across the same samples. Activity is measured by DHS signal in ENCODE and Roadmap data, and by expression level in FANTOM5 and GRO-seq data. Samples were labeled with a cell-type label out of cell-types. The output of FOCS is predicted E-P links.
First, FOCS builds for each promoter an OLS regression model based on the k enhancers whose center positions are closest to the promoter's center position (in ENCODE, Roadmap, and FANTOM5) or TSS (in GRO-seq). Formally, let be the promoter normalized activity pattern (measured in CPM -counts per million; is a row from ) and let be the normalized activity matrix of the corresponding k enhancers (CPM; k rows from ). We build an OLS linear regression model , where is a vector that denotes the errors of the model and is the ( 1) 1 vector of coefficients (including the intercept) to be estimated.

Second, FOCS performs leave-cell-type-out cross validation (LCTO CV) by training the promoter model based on samples from
1 cell types and testing the predicted promoter activity of the samples from the left out cell type. This step is repeated times. The result is a vector of predicted activity values for all samples.
FOCS tests the predicted activity values using two validation tests: (1) The binary test. This test examines whether discriminates between the samples in which was active (observed activity 1 RPKM) and the samples in which was inactive ( 1 RPKM). (2) The activity level test. This test calculates, for the active samples, the significance of the Spearman correlation between and . Spearman correlation compares the ranks of the original and predicted activities. We obtain two vectors of p-values, one for each test, of length (the number of promoter models).
Third, to correct for multiple testing, FOCS applies on each p-value vector the Benjamini -Yekutieli (BY) FDR procedure [7]. Promoter models with q-value≤ 1 in either both tests or in the activity level test were included in further analyses. In GRO-seq analysis, we also included models that passed only the binary test (m=2,580) since 57% of them had (Fig. S6B). For promoters that passed these CV tests final models are trained again using all samples. FOCS next selects informative enhancers for each final promoter model. First, to control the FDR due to multiple hypotheses we used the BY correction. We call this process enhancer BY FDR filtering (eBY). The OLS results provide for each model P-values for the coefficients of its 10 closest enhancers. FOCS applies BY correction on the P-values produced by all models together and selects enhancers with q-value ≤ 1. To identify the most important ones out of the selected (≤ 1 ) enhancers for each promoter model, FOCS applies elastic-net model shrinkage (enet) with a regularization parameter , using the glmnet R function [8] with mixing parameter =0.5, giving equal weights for Lasso and Ridge regularizations. We require that all the enhancers that survived eBY filtering will be included in the shrunken model. To achieve this we take the maximum satisfying this property. For models in which no enhancer survived the eBY filtering, we took the maximum yielding a shrunken model with at least one enhancer. This ensures that every promoter that passes the CV tests also has a model following the enet step.

Alternative regression methods
We compared the performance of OLS method with GLM.NB and ZINB regression methods. We repeated the FOCS steps but in the first step, instead of OLS we applied the GLM.NB or the ZINB methods. In GLM.NB/ZINB we used for and the raw count values instead of CPM. To correct the model according to differences in samples library sizes, we provided these sizes as an offset vector to GLM.NB and ZINB methods.
FANTOM5 E-P linking using OLS regression was followed by Lasso shrinkage (defined as OLS-LASSO) as described in [9]. Briefly, promoter models were created using OLS and models with were accepted for further analyses. Next, penalized Lasso regression was used to reduce the number of enhancers in the models. Optimal models were selected using 100-fold cross validation and the largest value of lambda such that the mean square error was within one standard error of the minimum, using the cv.glmnet() function in R glmnet package [8]. OLS followed by enet (called OLS-enet) was run with mixing parameter = 0.5 in the cv.glmnet() function. OLS followed by LASSO (OLS-LASSO) was run with 1

GO enrichment analysis
GO enrichments were calculated using topGO R package [10] (algorithm="classic", statistic="fisher", minimum GO set size=10). We split the genes into target and background sets using their enhancer bin sets. Genes belonging to bins with 1-3/1-4/4-10/5-10 enhancers were considered as target set and compared to all genes from all bins as background set. Correction for multiple testing was performed using BH procedure [11].
We downloaded 922,997 ChIA-PET interactions (assayed with RNAPІІ, on four cell lines: MCF7, HCT-116, K562 and HelaS3) from the chromatin-chromatin spatial interaction (CCSI) database [12] (GEO accession numbers of the ChIA-PET samples are provided in supplementary  table S6). We used the liftOver tool (from Kent utils package provided by UCSC) to transform the genomic coordinates of the interactions from hg38 to hg19. HiChIP interactions mediated by YY1 TF (cell types: HCT116, Jurkat, and K562) were taken from [13] (GEO accession id: GSE99521). As done in [13], we retained 911,190 YY1-HiChIP interactions with origami probability>0.9. Origami is a method that aims to find high confident interactions. For eQTL SNPs, we used the significant SNP-gene pairs from GTEx analysis V6 and V6p builds. 2,283,827 unique eQTL SNPs covering 44 different tissues were downloaded from GTEx portal [14].
We used 1Kbp intervals (±500 bp upstream/downstream) for the promoters (relative to the center position in ENCODE/Roadmap/FNATOM5 or to the TSS position in GRO-seq) and the enhancers (±500 bp from the enhancer center). An E-P pair is considered supported by a particular capture interaction if both the promoter and enhancer intervals overlap different anchors of an interaction. An E-P pair is considered supported by eQTL SNP if the SNP is located within the enhancer's interval and is associated with the expression of the promoter's gene. For each predicted E-P pair we checked if the promoter and enhancer intervals are supported by capture interactions and eQTL data. We then measured the fraction of E-P pairs supported by these data resources.
To get an empirical P-value for the significance of the fraction, we performed 100 permutations on the data (100 permutations were sufficient as in all methods we got empirical P-value<0.01). In each permutation, for each promoter independently, if it had E-P links, then enhancers on the same chromosome with similar distances from the gene's TSS as the linked enhancers were selected randomly. For this purpose we used the R 'Matching' package [15]. The fraction of overlap with the external data was computed on each permuted data.