Skip to main content

DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants

Abstract

Background

Genome-wide association studies (GWAS) have revealed a multitude of candidate genetic variants affecting the risk of developing complex traits and diseases. However, the highlighted regions are typically in the non-coding genome, and uncovering the functional causative single nucleotide variants (SNVs) is challenging. Prioritization of variants is commonly based on genomic annotation with markers of active regulatory elements, but current approaches still poorly predict functional variants. To address this, we systematically analyze six markers of active regulatory elements for their ability to identify functional variants.

Results

We benchmark against molecular quantitative trait loci (molQTL) from assays of regulatory element activity that identify allelic effects on DNA-binding factor occupancy, reporter assay expression, and chromatin accessibility. We identify the combination of DNase footprints and divergent enhancer RNA (eRNA) as markers for functional variants. This signature provides high precision, but with a trade-off of low recall, thus substantially reducing candidate variant sets to prioritize variants for functional validation. We present this as a framework called FINDER—Functional SNV IdeNtification using DNase footprints and eRNA.

Conclusions

We demonstrate the utility to prioritize variants using leukocyte count trait and analyze variants in linkage disequilibrium with a lead variant to predict a functional variant in asthma. Our findings have implications for prioritizing variants from GWAS, in development of predictive scoring algorithms, and for functionally informed fine mapping approaches.

Background

Only a very small (2%) proportion of the human genome codes for protein-coding genes. Yet, genetic variants in the non-coding genome are known to contribute to Mendelian diseases and complex traits. Genome-wide association studies (GWAS) have uncovered thousands of genetic variants that are associated with traits or alter disease risk, and these loci predominantly lie in the non-coding genome [1] Identifying functional and causal non-coding variants from GWAS, and the target genes they regulate, remains a major challenge. The model for effects on trait proposes that non-coding variants alter the targets of sequence-specific nucleic acid binding proteins in cis-regulatory elements (CREs) resulting in quantitative changes in gene expression. This could be at the level of modulating RNA splicing by RNA binding proteins [2], or through altering the affinity for DNA binding proteins such as transcription factors (TF) at enhancers [3, 4]. Rare, pathogenic mutations in enhancers have been shown to reduce [5] or increase [6] TF binding. Contrasting the large functional affects for rare alleles, common single nucleotide variants (SNVs) tagged in GWAS are thought to subtly modulate TF occupancy at CREs [4, 7, 8].

Because GWAS identifies regions that contain multiple SNVs in linkage disequilibrium (LD), identifying the functionally relevant SNV(s) poses a significant challenge. One route is to analyze chromatin structure at active CREs. Binding of TFs at enhancers leads to nucleosome displacement, and recruitment of co-activators that can then further modify chromatin including, but not limited to, acetylation of histone H3 lysine 27 (H3K27ac) [9]. The Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) is widely used for profiling sites with disrupted nucleosome structure and has been used in efforts to prioritize risk variants [10, 11]. Prior to the widespread use of ATAC-seq, DNase hypersensitive site-seq (DHS-seq) and DNase I footprinting have been used to identify sites of likely TF-binding and nucleosome disruption and to assess the effect of genetic variants on TF binding [4, 12, 13].

Enhancers and their target genes are most often located in the same topologically associating domain (TAD). Therefore, the integration of GWAS with data on tissue-specific features of active chromatin (ATAC-seq, DHS-seq, DNase footprinting, H3K27ac ChIP-seq) and chromosome conformation capture has been used to try and predict functional variants and their target genes [14, 15, 16]. However, predictions based on active regulatory marks often fall short in pinpointing functional variants [17, 18].

Computational models have also been developed to predict functional non-coding variants using machine learning methods including deep neural networks or linear regression models based on feature annotations, such as open chromatin, histone marks, TF binding, and conservation scores [19]. However, while showing some efficacy for rare variants, they poorly predict variant effects for common non-coding variants [19, 20]. Prioritizing and identifying functional non-coding variants therefore remain a barrier to GWAS follow-up studies.

Another feature of active enhancers is the production of short, unstable bidirectional enhancer RNAs (eRNAs) [21]. To the best of our knowledge, this feature has not been used in the prediction of functional genetic variants. Here, we analyze the utility of active regulatory element marks—open chromatin, H3K27ac, DNase footprints, TF binding, as well as the presence of divergent eRNA, for the prioritization and discovery of functional variants. We analyzed active regulatory markers using > 50,000 datasets from > 2000 cell types, tissue types, and treatment conditions and benchmarked these features with variants that demonstrate enhancer modulating activity using cis-acting molecular quantitative trait loci (molQTL). We find that the combination of DNase footprints and eRNA achieves high precision in predicting functional non-coding variants, and we apply this approach to data from GWAS traits. In this way, we demonstrate a framework to allow prioritization of candidate variants from GWAS data to aid discovery of functional and causative genetic variants.

Results

A compendium of genomic markers of active regulatory regions

Functional non-coding genetic variants are thought to reside in CREs. Based on this, we collated a compendium of published datasets of markers for active CREs from human cells and tissues. We included markers for open chromatin (DHS from DNase-seq and Assay for Transposase-Accessible Chromatin (ATAC-seq)), H3K27ac, divergent enhancer RNA (eRNA), DNase footprints, and DNA-binding factor occupancy from chromatin immunoprecipitation (ChIP). eRNA data come from multiple assays including Cap Analysis Gene Expression (CAGE), Global run-on (GRO)-seq, Precision nuclear run-on (PRO)-seq, and capped small RNA-seq [22]. Data were collected from uniformly processed resources, which included multiple replicates, across cell lines, primary cells, or tissue biosources. In aggregate, this comprised > 50,000 datasets from > 2000 biosources (Fig. 1A, Additional file 1: Table S1). Most datasets are from ChIP-seq, including ChIP for 956 DNA-binding factors and for some factors across multiple cell types or treatment conditions. Only one cell-type, hepatocytes, shares data across all assay types (Additional file 1: Fig. S1).

Fig. 1
figure 1

A compendium of markers of active regulatory regions. A Heatmap for six datasets that mark active regulatory elements. H3K27ac data were obtained from two different sources. Footprints are derived from DNase-seq data. The number of biosources (cell lines, primary cells, or tissues) are indicated. For ChIP-seq datasets, this consists of 956 DNA-binding factors. B Boxplot for each feature length from each biosource with the number shown indicating the median length in base pairs (bp). N, total number of elements for each feature, summed for all biosources regardless of overlapping features. C Genomic annotations for markers of active regulatory markers from merged biosources. TTS, transcription termination site; TSS, transcription start site. D Pairwise Jaccard index for merged intervals from DNase-seq, DNase footprints, ATAC-seq, ChIP-seq, eRNA, and H3K27ac

Taking the aggregate across all biosources for each marker, the number of elements ranges from 1.3 million (eRNA) to 93.3 million (ATAC-seq), with the median length of the elements dependent on the assay (Fig. 1B). The majority of active CREs are found in the non-coding regions—intergenic regions, promoters, or introns (Fig. 1C). We computed the pairwise Jaccard index for markers, which shows the highest similarity is between ATAC and H3K27ac (Fig. 1D).

For each marker, we merged overlapping datasets across all biosources to determine the non-redundant genome coverage of active CREs. The largest coverage is represented by ATAC-seq (75.2%), followed by H3K27ac (61.5%). This is likely due to the high number of biosources (Fig. 1A), the longer median length and distribution (Fig. 1B), but also probably indicates too low a threshold used in peak calling. The smallest coverage is from eRNA (2.1%) followed by DNase footprints (3.6%) (Additional file 1: Fig. S1B).

Prioritization of genetic variants by combinations of regulatory markers

Functional non-coding genetic variants are likely to impact target gene expression through altering CRE activity. To assess the utility of markers of active CREs in prioritizing functional non-coding variants, we established a set of variants which colocalize with molecular quantitative trait loci (molQTL) to benchmark against. While genetic variants that impart phenotypic variation have numerous terms dependent on the molecular trait, we use the unifying term QTL here to include molQTL detected by regression, but also variants with allelic imbalance determined by other statistical methods. We selected three molQTL assays which measure different molecular mechanisms and which demonstrate variant effects in cis that impact CRE activity: DNA-binding factor QTL (bQTL) [23], massively parallel reporter assay (MPRA) QTL (raQTL) [24], and chromatin accessibility QTL (caQTL)—a merge of two sources [25, 26] and [13, 27]—also called chromatin-altering variants (CAV) (Fig. 2A). Variants identified by these molQTL assays are derived from > 16 K datasets across > 800 biosources (Fig. 2B, Additional file 1: Table S3). Most of these datasets are from bQTL assays based on ChIP-seq annotations, while caQTL are based on DNase-seq and ATAC-seq datasets. raQTL variants were obtained from one resource which identified raQTLs in K562 and HepG2 cell lines, using an unbiased set of 5.9 million variants [24]. We have not included other studies using MPRA to identify raQTL as these often use variants identified from specific GWAS traits which would bias our downstream analysis. There was minimal overlap between variants colocalizing with bQTL, raQTL, and caQTL (Fig. 2C). This maintains each molQTL assay as a largely independent set of variants.

Fig. 2
figure 2

Molecular quantitative trait loci (molQTL) with effects on regulatory element activity. A MolQTL with effects on regulatory element activity include binding QTLs (bQTL), reporter assay QTLs (raQTL) from massively parallel reporter assays (MPRA), and chromatin accessibility QTLs (caQTL). The model depicts binding of a DNA-binding factor, which is altered by a single nucleotide variation, thus reducing binding, reporter assay expression, and chromatin accessibility. B Heatmap showing the molQTL datasets and biosources. bQTL data are derived from ChIP-seq experiments of 1073 DNA-binding factors. C Proportional Venn diagram of the number of QTLs from each assay showing minimal overlap

Using each molQTL to benchmark, we aimed to identify functional variants from the GWAS catalog [28, 29], which is derived from > 28 K studies and > 5 K traits or diseases, and containing variant-trait associations with p value of ≤ 1.0 × 10−5 (Additional file 1: Table S3). This set contains > 180 K non-redundant variants, the majority of which are in non-coding regions (Additional file 1: Fig. S2A). To identify how active regulatory marks can functionally annotate non-coding variants, we merged all biosources for each mark, and in the case of ChIP this included merging all DNA-binding factors (Additional file 1: Fig. S2B). This overcomes two main issues. Firstly, that the cell- or tissue-type in which a variant is functional is not known a priori. Secondly, the low overlap of biosources for each marker (Fig. 1B) would limit intersection analysis. We therefore took a cell/tissue agnostic approach by merging biosources. With this, we determined the co-localization of the GWAS catalog variants with each marker of active CREs, and all combinations of markers, totaling 55 combinations from two-way up to six-way combinations.

We determined the precision (positive predictive value (PPV) − [true positive/(true + false positives)]) for each marker, and their combinations, by the intersection with variants from the GWAS catalog, benchmarked against the intersection of the marker combinations with each molQTL as a true positive set. The precision value was dependent on the size of the molQTL set relative to the larger GWAS catalog. The range of precision values was 7.5–55.46% for bQTL, raQTL 0.42–3.53%, and caQTL 4.02–27.7%. In all cases, the highest precision was achieved by the combination of all six markers. To compare precision values between the molQTL, we converted precision values to Z-scores for each molQTL, giving a range of scores from − 2.2 to 1.44. Using unsupervised clustering, we generated a Z-score heatmap of precision values for feature combinations against molQTL assays, showing four clusters (Fig. 3A). The cluster with the highest Z-scores (cluster 3) showed that two markers, DNase footprints and eRNA, were present for all feature combinations within the cluster, including the combination of these two markers alone. In contrast, the cluster with the lowest Z-score (cluster 1) excluded any combination of markers that had footprints or eRNA. As independent markers, DNase footprint or eRNA were present in an intermediate cluster (cluster 2). This suggests that the combination of DNase footprints and eRNA, with or without other markers, provides high precision for identifying functional variants. Increasing the number of features increased the number of predicted negative variants, although the number of predicted negative variants is > 90% for footprints or eRNA alone (Additional file 1: Fig. S3A).

Fig. 3
figure 3

Identifying functional variants using combinations of markers of active regulatory elements. A Heatmap of precision Z-scores for individual, or combinations of, markers for GWAS catalog variants. Variants from the GWAS catalog were intersected with each feature or combinations of features, benchmarked against molQTLs. Unsupervised clustering was performed and percentages of each feature represented in each cluster are shown below. B Fold change of precision scores for functional variant discovery by feature combinations from random genomic SNVs. Common (MAF ≥ 1%) SNVs from the whole genome were subsetted to a comparable number to the GWAS catalog. Precision scores for variants that intersect with feature(s), and co-localize with a molQTL, are expressed as fold change relative to the number of subsetted variants that co-localize with molQTL. Feature combinations are sorted from lowest to highest fold change. One hundred iterations of subsetting from dbSNP were performed and error bars shows standard deviation. C SHAP (SHapley Additive exPlanations) values for a machine learning model using a random forest classifier to identify feature importance for each active regulatory mark towards the functional variant predictive model

While all six features are markers of active CREs, we considered if GWAS variants that intersect these features are associated with functional enhancers. Using a validated set of functional enhancers from CRISPR deletion and CRISPR interference experiments, and non-enhancers from MPRA experiments [22], we overlapped GWAS variants with feature(s). We find that GWAS variants in footprints and eRNA independently are associated with functional enhancers compared to other features alone (Additional file 1: Fig. S3B). However, GWAS variants that coincide with footprints and eRNA together have the highest overlap with functional enhancers.

We next considered the precision of marker combinations to identify functional variants across the entire genome, without association with GWAS traits. Taking all common genetic variants (minor allele frequency (MAF) ≥ 1%), consisting of > 660 M non-redundant SNVs, we randomly selected subsets to obtain a comparable number to the GWAS catalog of ~ 180 K variants. We calculated the precision of combinations of markers and then converted this to a fold change over random SNVs. We repeated the random SNV subsetting 100 times to generate a mean fold change and plotted the markers and combinations from lowest to highest fold change (Fig. 3B). This showed that ATAC-seq, H3K27ac, and the combination of the two are poor discriminators of functional variants (mean fold change 1.12–1.44). In comparison, a step change is observed for the combination of DNase footprints and eRNA, with a mean fold change of 13.89 over random SNVs. Footprints and eRNA alone show a mean fold change of 7.37 and 8.09, respectively, suggesting that they can individually provide functional annotation of variants across the genome, but with higher precision when combined. This is in keeping with the precision observed from footprints with eRNA to identify functional variants in the GWAS catalog. Our analyses suggest that the combination of DNase footprints and eRNA are strong predictors of functional variants with high precision, with additional markers providing only small increments in precision. Importantly, when considering the precision of GWAS compared to random genomic variants, the GWAS catalog shows a mean fold change of 8.24 which, perhaps expectedly, demonstrates that GWAS can identify functional variants, above a random model.

To corroborate our findings, we employed an independent machine learning approach. We trained a random forest classifier model to predict any molQTL signature (bQTL/raQTL/caQTL) based on overlaps with the six active regulatory markers, providing a predictive value of 0.64. We determined SHAP values as a measure of feature importance for the model, which revealed that DNase footprints and eRNA are the top positively associated features for the predictive model (Fig. 3C).

Effect of thresholding of feature calls on precision scores

The local genomic enrichment of assay signals is often translated to interval regions (peaks) through different methods and at varying significance thresholds. These statistical variations in calling enriched regions, for example, false discovery rates (FDR), may impact precision scores for functional variants. For DNase footprints, we used consensus footprints determined by a posterior probability > 0.99 across biosources [13, 27]. We compared this with FDR for footprints from individual biosources and determined the precision scores associated with increasing FDR stringency. For footprints, we find that increasing stringency does increase precision, although the consensus footprint calling performs well (Additional file 1: Fig. S3C). For the remainder of our analyses, we restrict to the consensus set of footprints as these are derived across biosources and are less biased towards individual datasets or cell/tissue types. As ATAC-seq performs relatively poorly with regard to predicting functional variants, peak calling thresholds may impact precision scores. We used an independently analyzed source (ATACdb) [30, 31], which has a smaller number of datasets (~ 1400) compared to the ~ 3700 from GTRD [32, 33]. We further threshold these datasets by p value to increase stringency over local background. We find ATACdb to have a high precision score, but increasing p value thresholds does not improve precision scores. We have taken a third source of ATAC-seq data from ENCODE, using narrow peaks to define local regions of high amplitude and consider this dataset with and without irreproducible discovery rate (IDR) thresholding. We find the ENCODE narrow peaks precision scores improve slightly with IDR thresholding (Additional file 1: Fig. S3). Data processing and significance stringency can therefore impact the precision scores for identifying functional variants. However, this analysis shows that the relatively low predictive value of ATAC-seq peaks is not simply due to insufficiently stringent thresholding.

Centrality in open chromatin to predict functional non-coding variants

TF binding can initiate and maintain chromatin accessibility [34] and indeed summits of DHS have been shown to be enriched for DNA-binding factor motifs [13]. Using a consensus reference map of DHS with summit annotations [35, 36], we tested if centrality in DHS could predict functional variants. We compared core DHS regions, which comprises a consensus centroid region aggregated across biosamples, and distances from the summit at 25 bp, 50 bp, and 100 bp up- and downstream, with a median core length of 37 bp (Additional file 1: Fig. S4A, B). By using variants from the intersection with the GWAS catalog benchmarked against molQTL variants, we calculated precision scores as fold change above GWAS catalog alone. DHS cores have ~ 1.5-fold precision, and this is not increased by extending to 100 bp on either side of the summit (Fig. 4A). Although many footprints fall outside the DHS core, DNase footprints are found at a higher frequency nearer DHS summits (Fig. 4B). Together, these data suggest a mechanism for functional variants related to the enrichment of DNA-binding motifs near DHS summits. When comparing centrality as a means to identify functional variants with DNase footprints and eRNA, the precision is even higher with the combined DNase footprints/eRNA markers (Fig. 4A).

Fig. 4
figure 4

Centrality in the identification of functional variations, and benchmarking against non-coding Mendelian disease-associated variants. A Precision score of DHS cores and DHS central lengths around summits, compared to DNase footprints with eRNA. Precision scores are expressed as fold change of precision score for the indicated feature over the precision score for GWAS variants alone. Scores are benchmarked for each molQTL. B Distribution of nearest DHS footprint up- and downstream relative to DHS summits. C Genomic annotation of validated, rare, non-coding Mendelian disease-associated variants. D Precision-recall graph determined from 230 non-coding Mendelian disease-associated variants, with spike-in of random genomic variants from dbSNP (~ 4 K) to produce a validated variant probability of ~ 0.05. Features were intersected to determine precision (positive predictive value—PPV) and recall (sensitivity). Binary outcomes were determined from feature intersections

Replication with functional, rare, non-coding Mendelian disease-associated variants

To validate our observation of high precision using the combination of DNase footprints and eRNA, we sought to replicate our analysis using a set of validated functional non-coding variants. We used a published set of manually curated rare non-coding variants associated with Mendelian diseases consisting of 210 non-redundant variants in promoter, intergenic, intronic, or 3′UTR regions [37] (Fig. 4C). Given the small number of variants within this validated set, we limited our precision calculations to single markers, or the combination of DNase footprints and eRNA only. We determined precision by spike in of ~ 4 K randomly selected variants across the whole genome, giving a true positive rate of approximately random chance (~ 0.05). We generated precision and recall scores based on the mean from 10 permutations of random genomic variant spike-ins (Fig. 4D). The combination of DNase footprints and eRNA shows the highest precision at 0.65, but low recall (0.25). In contrast, ATAC-seq and H3K27ac show low precision but high recall. DNase footprints and eRNA each alone have intermediate precision and recall. These observations are consistent with our analysis of the GWAS catalog and genomic variants, suggesting the utility of DHS footprints and eRNA in identifying both rare and common functional non-coding variants.

Prioritizing variants from genome-wide association studies

To demonstrate the utility of DNase footprints and eRNA in prioritizing non-coding variants, we analyze 53 traits for intersections with footprints, eRNA, or both (Table 1). Across these traits, the majority of associated variants are, as expected, in the non-coding genome, and this is maintained when using the combined footprint and eRNA marks to prioritize variants (Fig. 5A). Using the leukocyte count trait from the GWAS catalog to exemplify our prioritization approach, ~ 10 K non-redundant variants with significance of < 9 × 10−6 from 51 studies are reduced ~ 30-fold by requiring co-localization with DNase footprints and eRNA (Table 1). Using gene ontology (GO) analysis of neighboring genes, we applied GREAT analysis [38] to all variants from the trait compared to just the variants within footprints and eRNA. GO terms are preserved between all variants and those in footprints and eRNA (Fig. 5B), with enriched GO terms in keeping with leukocyte function. For the most significant GO term, regulation of immune response, we observe preservation of 55% of genes in the footprint and eRNA variant set, compared with all variants (Fig. 5C) suggesting that, despite a ~ 30-fold reduction in number of candidate variants, the enriched genes and loci are largely preserved.

Table 1 Sources for each type of dataset used. For more detail please see Additional file: Table S1
Fig. 5
figure 5

Prioritization of variants using DNase footprints and eRNA preserve cell-specific features and functions. A Genomic annotation of all variants for 53 traits from the GWAS catalog, and annotations when intersection with DNase footprints, eRNA, or both in combination. B Gene ontology (GO) analysis of proximity genes relative to variants of lymphocyte count trait, for all variants for the trait from the GWAS catalog, variants in DNase footprints, eRNA, or both. Significance of the GO term is expressed as − log10 of the FDR q value (FDRq). Top ten GO terms for all variants are shown with FDRq for each intersection. C Number of genes for the GO term “regulation of immune response” comparing number of genes from all variants compared to number of genes from variants in footprints and eRNA. D Enrichment analysis, using FORGE2, of cell-type specific active CREs for the lymphocyte count trait, comparing variants in DNase footprints and eRNA, compared to an equal number (333) of randomly selected variants from all lymphocyte count variants from the GWAS catalog. The random selection was repeated ten times, and the mean p value (− log10) is shown. E An example lymphocyte count-associated variant (rs12722502) at the IL2RA locus which co-localizes with a DNase footprint and eRNA. UCSC genome browser shot shows merged features and ENCODE DNase-seq bigwig tracks for each cell-/tissue-type. F Predicted impact of rs12722502 on DNA-binding factors. Scatter plot of the major and minor allele binding score for DNA-binding factor motifs showing predicted gain or loss of binding. Example DNA-binding factors are labeled

Non-coding variants function in a cell-type specific manner (e.g., [46, 47]) but uncovering the affected cell-type from GWAS can be challenging. By leveraging DNase footprints and eRNA to identify functional variants with high precision, we can then infer the relevant cell-type from this subset of variants. As both footprints and eRNA are associated with open chromatin [25], we took the leukocyte count associated variants co-localizing with footprints and eRNA and identified cell-type specific enrichment of DHS using FORGE2 [48]. Since footprints are derived from DNase-seq, the subset of variants that colocalize with DNase footprints and eRNA show a greater enrichment in DHS compared to an equal number of randomly selected GWAS variants for leukocyte count. Importantly however, despite our cell- and tissue-type agnostic approach of merging features, colocalized variants show enrichment in hematopoietic or immune cell types (Fig. 5D), demonstrating preserved ability to identify trait relevant cell-types despite ~ 30-fold reduction in variant number. Highlighting one variant identified from DNase footprints and eRNA, rs12722502 is an IL2RA intronic variant and is in a myeloid and lymphoid specific DHS (Fig. 5E). This variant is predicted to cause a loss of binding of ETS family TFs (Fig. 5F), members of which are known to regulate immune cell function [49]. Indeed, rs12722502 is associated with fivefold reduction in binding of ELF1 (a TF in the ETS family), in a human leukemic T-cell line [50].

Identifying functional variants in linkage disequilibrium with lead variants

Uncovering functional or causative variants from GWAS is also challenged by LD, whereby the causative variant may be a variant in LD with the GWAS lead variant. We analyzed the utility of the six active regulatory features, and the combination of footprints and eRNA, to identify functional variants in LD with GWAS variants. Using GWAS variants from 53 traits, and variants in LD (R2 > 0.7) with each GWAS variant, we curated ~ 1.5 M variants across traits, which together consisted of ~ 800 K non-redundant variants. Taking GWAS and LD variants for each trait, we calculated precision and recall scores, by benchmarking against the combined set of variants in bQTL, raQTL, and caQTL. We determined mean precision and recall scores for the regulatory feature(s) (Fig. 6A, B) and observe the DNase footprint and eRNA combination to reproducibly have the highest precision, but low recall. While including variants in LD is important, this increases the variant set by a mean of ~ 27-fold across traits. However, considering only LD variants intersecting footprints and eRNA reduces the variant set by a mean of ~ 80-fold (Additional file 1: Table S2).

Fig. 6
figure 6

DNase footprints and eRNA can prioritize functional variants for variants in LD with a lead variant. A Using GWAS variants from 53 traits, mean precision and recall scores were determined by intersecting indicated feature(s) with GWAS and LD variants, against a combined set of QTL variants from bQTL, raQTL, and caQTL. LD variants were determined for each GWAS variant using an R2 > 0.7 For ATAC, intervals were obtained from GTRD, or ENCODE with IDR thresholding. Lines across the mean show the 95% confidence interval of the precision or recall score. B Precision and recall scores for GWAS and LD variants for the leukocyte count trait, asthma, and type 2 diabetes (T2DM). C Manhattan plot of the asthma-associated rs72823641 variant (red), with variants in LD ≥ 0.7 indicated in orange. Merged feature tracks are shown for the locus. Genome co-ordinates (Mb) are indicated. D rs10173081 overlaps a DNase footprint and eRNA. UCSC genome browser shot for the intronic region of IL1RL1 with rs10173081 overlapping a myeloid specific DHS. E rs10173081 is an eQTL for multiple genes in lung and whole blood. GTex violin plots for normalize expression of the indicated genes across homozygous major allele, heterozygous major and minor allele, and homozygous minor allele. F Predicted impact of rs10173081 on DNA-binding factors. Scatter plot of the major and minor allele binding score for DNA-binding factor motifs showing predicted gain or loss of binding. Example DNA-binding factors are labeled

To demonstrate the validity of colocalization with DNase footprints and eRNA to prioritize variants in LD with lead variants, we selected the asthma trait, and considered strong lead GWAS variants which were not associated with the HLA locus, to avoid cell-type bias. We identified rs72823641, which has been found to be associated with asthma in multiple GWAS [51], with p values ranging from 4 × 10−12 to 2 × 10−136, and mapping to a large intron of IL1RL1. We found 92 variants in LD with rs72823641 (R2 > 0.7) using European ancestry data (Fig. 6C). Intersecting these with DNase footprints and eRNA reduced the variants to one candidate—rs10173081 (R2 = 0.96), which resides in myeloid specific open chromatin (Fig. 6D). While rs10173081 is not tagged in the GWAS catalog, it has been observed in other GWAS for asthma [52]. rs10173081 is also an eQTL for IL1RL1 and IL18R1 in lung, but also for IL18RAP and a non-coding RNA AC007278.3 in whole blood (Fig. 6E). Applying TF motif based binding prediction, we identified putative loss of binding for TFs including TCF21, TCF12, TCF4, and KLF3 (Fig. 6F). Interestingly, TCF21 and KLF3 loci have been previously identified as asthma risk-associated GWAS loci [53]. In conclusion, by using combined DNase footprint and eRNA markers, plausible functional variants can be prioritized from variants sets in LD with lead variants.

Discussion

GWAS have been instrumental in uncovering complex trait- and disease-associated variants and regulatory pathways. However, the identification of functional and causative variants remains a significant challenge, impacting the mechanistic understanding of complex disease genetics and the identification of disease/trait relevant pharmacological targets. Consequently, few GWAS have been translated into therapeutic interventions [54]. Discovering functional variants in the non-coding genome constrained by methodological limitations. Computational approaches have been developed that utilize variable genomic features to aid variant prioritization, including functional annotations, conservation scores, and TF motifs [19, 55]. However, for non-coding variants, these methods tend to have low precision for identifying functional variants [19].

Here, we systematically analyzed six assays for active CREs, benchmarking against independent molQTLs for enhancer activity. We find that ATAC peaks and H3K27ac are poor predictors. This is in keeping with previous studies that have shown that deletion of elements harboring candidate variants marked by ATAC peaks or H3K27ac is unable to detect functional impact [17, 18, 56]. We hypothesize that the poor precision, but high recall, of ATAC-seq and H3K27ac peaks relates to their high feature size and high genomic coverage. In contrast, we identify a signature that performs well with respect to precision in identifying genetic variants with putative function. DNase footprints and eRNA have independently been suggested to be enriched for putative functional variants [13, 22]. Individually DNase footprints and eRNA both show good precision; however, we demonstrate that their combination provides superior discrimination. When using genomic features for prioritizing functional variants, consideration should be made for statistical stringency as this may alter precision for functional discovery. Additionally, centrality may contribute, but we find including central regions imparts little improvement in precision compared to the combined DNase footprint and eRNA marks.

Our model for the precision offered by combined DNase footprints and eRNA markers is based on direct DNA binding of a factor, such as a TF, leading to recruitment of RNA polymerase II and other machinery to promote regulatory activity. In agreement with this model, eRNAs have been shown be an independent predictor of functional enhancers in mice in vivo [57], and to predict TF binding, where TF binding and eRNA origins colocalize [4]. eRNAs have also been found to mark functional enhancer-promoter interactions [58]. Together, these suggest eRNAs are important markers of functional CREs.

Some markers of active elements perform with low precision, such as open chromatin and DNA-factor binding (ChIP). Consistent with this, < 30% of ATAC peaks in primary CD4 + T-cells demonstrated activity using an MPRA, with MPRA signals instead correlating well with eRNA production [59]. If evidence of binding is required for functional variant prediction, then ChIP-seq signals would be expected to perform well. However, MPRA of glucocorticoid receptor binding sites showed < 20% had significant activity [60], suggesting that many bound TFs are not productive. However, artefactual binding observed from cross-linking in ChIP-seq experiments [61], and co-localization of multiple TF binding events enriching for non-related motifs, can lead to false positive binding sites [62]. Therefore, the footprint signal of direct TF binding, with a detection of eRNAs, provides better evidence for an active enhancer. Variants within these regions, particularly in DNA sequences interfacing with TF DNA binding domains (DBD), are thus more likely to be functional.

Despite the high precision for functional variants using the combination of DNase footprint and eRNA, restricting to these two markers comes with limitations, principally related to low recall, and a high number of false negatives. The low recall can be explained by limitations inherent to the DNase footprint and eRNA datasets. Firstly, the identification of DNase footprints may be related to residence time of TFs on DNA, where rapidly exchanging factors impart poor footprints [63]. Variants associated with altering binding of dynamic factors may therefore be missed, although this could be improved using modified methods such as DNase-seq with crosslinking [64]. Detection of footprints could also be improved by bias correction to improve false-negative rates [65]. Secondly, the eRNA database used in our analysis represents the lowest number of biosources, and is therefore incomplete, with highly variable numbers of features between cell-types, and predominantly under basal conditions, which would exclude context-dependent enhancer activity [22, 60]. To address this, eRNA profiling is required in cell-types, and under conditions, relevant to the GWAS trait of interest. The specific assay for divergent eRNAs may also impact precision scores for functional variants, as GRO-cap has been reported to have the highest sensitivity for functional enhancer detection [22].

Conclusion

We present our findings as a framework, FINDER—Functional SNV IdeNtification using DNase footprints and Enhancer RNA (Fig. 7). In this framework, the combination of DNase footprints and eRNA can prioritize and reduce candidate variants to a set with high precision. The merged datasets used can be analyzed for relevant cell-types associated with the complex trait or disease. This has important implications for current computational approaches and databases for predictive functional variant scores. Our findings can inform predictive scoring approaches as these do not utilize eRNA as a predictive marker. Finally, the high precision of DNase footprints and eRNA compared to other marks may be used for functionally informed fine mapping approaches [66], which could improve detection of causative non-coding variants.

Fig. 7
figure 7

The FINDER framework—Functional SNV IdeNtification using DNase footprints and Enhancer RNA. Markers of active regulatory elements as merged datasets from multiple cell- and tissue-types are used to predict functional variants. The combination of DNase footprints and eRNA provides the highest precision, which can be applied to GWAS traits or to predict functional variants in linkage disequilibrium (LD). Predicted variants can then be interrogated for cellular and molecular function, by deconvolving marker enrichment to predict relevant cell-types, or the predict altered binding affinity for DNA-binding factors

Methods

Data sources

Datasets for active chromatin marks were obtained from published and publicly available sources. DHS [35, 36] and DNase footprints [13, 27] were obtained from ENCODE. H3K27ac was obtained from GTRD [32, 33] and dbInDel [43, 67]. Human ChIP-seq and ATAC-seq datasets were obtained from GTRD [32, 33]. Enhancer RNAs were obtained from the PINTS portal [22, 44]. These data sources provide uniformly analyzed datasets for downstream analysis. Details of downloaded versions and sources are shown in Additional file 1: Table S3. For each type of active chromatin mark, to generate assay master files agnostic to multiple cell-types or factor in the case of ChIP, Bedtools merge version from the bedtools suite v2.30.0 [42] was used, requiring overlap of 1 bp or greater. The Jaccard index was computed using bedtools jaccard from the bedtools suite v2.30.0 [42].

QTL datasets were obtained from published datasets. Binding QTL (bQTL) were downloaded from AD ASTRA which determines allele-specific binding events from human ChIP-seq datasets and identifies genetic variants from ChIP-seq data [23, 68]. Reporter assay QTL (raQTL) data were downloaded from publicly available massively parallel reporter assays (MPRA) [24]. Chromatin accessibility QTL (caQTL) were downloaded from ENCODE [13, 27] and QTLbase [25, 26]. Details of downloaded versions and sources are shown in Additional file 1: Table S3. For each molecular QTL, where an rsID occurs in multiple datasets, either multiple assays (such as ChIP) or multiple cell types, these were collapsed to generate a list of unique rsIDs.

GWAS summary statistics were downloaded from the NHGRI-EBI GWAS catalog V1.0.2 [28, 29] using variants with a p value < 10−5. Where analysis presented is agnostic to traits or studies, a master list of rsID found in the GWAS catalog was generated by collapsing to a list of unique rsIDs. Details of downloaded version and source are shown in Additional file 1: Table S3.

The dataset for genetic variants was obtained from UCSC [41] using the dbSNP build 151 containing uniquely mapping common SNPs with a ≥ 1% minor allele frequency (MAF). SNPs were randomly subsetted using gshuf in command line.

Confusion matrix calculations and normalizations

Precision scores were calculated as the fraction of the number of variants (GWAS catalog or other variant sets as indicated) overlapping the benchmarked molQTL (true positives) over the total number of variants (true and false positives) intersecting the feature (or feature combinations). Precision scores were expressed as a Z-score or fold change. Clustering of precision z-scores was performed in R using heatmap cluster function with the Ward D2 method. Fold changes were expressed as the precision score of the candidate set intersected with features divided by precision score of the candidate set without intersection with features.

Genome build conversions

Data were analyzed using human genome build Hg38. Where primary data sources were aligned to other genome versions (Hg 19), data was converted to Hg38 using Liftover [69]. The genome version of the primary data sources is indicated in Additional file 1: Table S3.

Intersections of features

To analyze the overlap between genetic variants, and/or genomic features, Bedtools intersect from the bedtools suite v2.30.0 [42] was used. For intersections with rsID, these were considered as the first feature (-a) in bedtools intersect -a < bed/vcf > -b < bed > .

Random forest classification

The PeakPredict package [45] was run using the command overlap_peaks with settings “–predict_column molQTL –model RandomForestClassifier –balance –shap,” where balancing downsamples the groups to have the same size prior to splitting into test and training sets. The PeakPredict package implements scikit-learn [70] and SHAP values [71].

Gene ontology analysis

GO analysis utilized Genomic Regions Enrichment of Annotations Tool (GREAT) version 4.0 [38] with rsID converted to BED format, using the default settings.

Predicting functional base on motif

Predictions of TF binding based on motifs utilized the FABIAN-variant [72] web interface, using Hg38m, using “All” TFS, the “TFFM detailed” model, and the JASPAR2022 database.

Linkage disequilibrium analysis

To generate variants in LD with a lead variant, we included variants within 500 Kb of a lead variant with an R2 ≥ 0.7 based on European ancestry and a minor allele frequency of at least 1%. To compile variants in LD, we used TopLD API with options -thres 0.7 -pop EUR -maf 0.01 [73].

Cell-type specific analysis

To identify cell-type enrichment of variants, we used FORGE2 [48], with input as rsID of variants, using DHS as data for enrichment, without LD filtering of rsIDs.

Genome location mapping

To annotate the genomic features to genome location, we used the annotatePeaks.pI program from HOMER tools v4.1 [74]. Gene annotation was customized using a GTF file containing hg38 gene transcript sets, downloaded from UCSC (hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz)

Availability of data and materials

Individual data sets are described in “Data sources” of the “Methods” section. The merged datasets for each active regulatory marker, merged as described in “Data sources,” and QTL merged datasets can be found at GitHub [77], available under an open source (MIT) license, and at Zenodo [78]. Further details are found in Additional file 1: Table S3.

References

  1. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–5. https://doi.org/10.1126/science.1222794.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Qi T, Wu Y, Fang H, Zhang F, Liu S, Zeng J, Yang J. Genetic control of RNA splicing and its distinct role in complex trait variation. Nat Genet. 2022;54:1355–63. https://doi.org/10.1038/s41588-022-01154-4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Johnston AD, Simões-Pires CA, Thompson TV, Suzuki M, Greally JM. Functional genetic variants can mediate their regulatory effects through alteration of transcription factor binding. Nat Commun. 2019;10:3472. https://doi.org/10.1038/s41467-019-11412-5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Maurano MT, Haugen E, Sandstrom R, Vierstra J, Shafer A, Kaul R, Stamatoyannopoulos JA. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet. 2015;47:1393–401. https://doi.org/10.1038/ng.3432.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Jeong Y, Leskow FC, El-Jaick K, Roessler E, Muenke M, Yocum A, Dubourg C, Li X, Geng X, Oliver G, et al. Regulation of a remote Shh forebrain enhancer by the Six3 homeoprotein. Nat Genet. 2008;40:1348–53. https://doi.org/10.1038/ng.230.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Lettice LA, Williamson I, Wiltshire JH, Peluso S, Devenney PS, Hill AE, Essafi A, Hagman J, Mort R, Grimes G, DeAngelis CL, Hill RE. Opposing functions of the ETS factor family define Shh spatial expression in limb buds and underlie polydactyly. Dev Cell. 2012;22:459–67. https://doi.org/10.1016/j.devcel.2011.12.010.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Carrasco Pro S, Bulekova K, Gregor B, Labadorf A, Fuxman Bass JI. Prediction of genome-wide effects of single nucleotide variants on transcription factor binding. Sci Rep. 2020;10:17632. https://doi.org/10.1038/s41598-020-74793-4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, et al. Variation in transcription factor binding among humans. Science. 2010;328:232–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Vernimmen D, Bickmore WA. The hierarchy of transcriptional activation: from enhancer to promoter. Trends Genet. 2015;31:696–708. https://doi.org/10.1016/j.tig.2015.10.004.

    Article  CAS  PubMed  Google Scholar 

  10. Grishin D, Gusev A. Allelic imbalance of chromatin accessibility in cancer identifies candidate causal risk variants and their mechanisms. Nat Genet. 2022;54:837–49. https://doi.org/10.1038/s41588-022-01075-2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Su C, Gao L, May CL, Pippin JA, Boehm K, Lee M, Liu C, Pahl MC, Golson ML, Naji A;, et al. 3D chromatin maps of the human pancreas reveal lineage-specific regulatory architecture of T2D risk. Cell Metab. 2022;34:1394-1409.e4. https://doi.org/10.1016/j.cmet.2022.08.014.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Moyerbrailean GA, Kalita CA, Harvey CT, Wen X, Luca F, Pique-Regi R. Which genetics variants in DNase-seq footprints are more likely to alter binding? PLoS Genet. 2016;12: e1005875. https://doi.org/10.1371/journal.pgen.1005875.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Vierstra J, Lazar J, Sandstrom R, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Haugen E, et al. Global reference mapping of human transcription factor footprints. Nature. 2020;583:729–36. https://doi.org/10.1038/s41586-020-2528-x.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Gazal S, Weissbrod O, Hormozdiari F, Dey KK, Nasser J, Jagadeesh KA, Weiner DJ, Shi H, Fulco CP, O’Connor LJ, et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat Genet. 2022;54:827–36. https://doi.org/10.1038/s41588-022-01087-y.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Nasser J, Bergman DT, Fulco CP, Guckelberger P, Doughty BR, Patwardhan TA, Jones TR, Nguyen TH, Ulirsch JC, Lekschas F, et al. Genome-wide enhancer maps link risk variants to disease genes. Nature. 2021;593:238–43. https://doi.org/10.1038/s41586-021-03446-x.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22:1748–59. https://doi.org/10.1101/gr.136127.111.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Chen HV, Lorenzini MH, Lavalle SN, Sajeev K, Fonesca A, Fiaux PC, Sen A, Luthra I, Ho AJ, Chen AR, et al. Deletion mapping of regulatory elements for GATA3 in T cells reveals a distal enhancer involved in allergic diseases. J Hum Genet. 2023;110(4):703–14.

    Article  CAS  Google Scholar 

  18. Gasperini M, Findlay GM, McKenna A, Milbank JH, Lee C, Zhang MD, Cusanovich DA, Shendure J. CRISPR/Cas9-mediated scanning for regulatory elements required for HPRT1 expression via thousands of large, programmed genomic deletions. Am J Hum Genet. 2017;101(2):192–205.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Wang Z, Zhao G, Li B, Fang Z, Chen Q, Wang X, Luo T, Wang Y, Zhou Q, Li K, et al. (2022) Performance comparison of computational methods for the prediction of the function and pathogenicity of non-coding variants. Genom Proteomics Bioinform. 7:S1672–0229(22)00016-X. https://doi.org/10.1016/j.gpb.2022.02.002.

  20. Tabarini N, Biagi E, Uva P, Iovino E, Pippucci T, Seri M, Cavalli A, Ceccherini I, Rusmini M, Viti F. Exploration of tools for the interpretation of human non-coding variants. Int J Mol Sci. 2022;23:12977. https://doi.org/10.3390/ijms232112977.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Hou TY, Kraus WL. Spirits in the material world: enhancer RNAs in transcriptional regulation. Trends Biochem Sci. 2021;46:138–53. https://doi.org/10.1016/j.tibs.2020.08.007.

    Article  CAS  PubMed  Google Scholar 

  22. Yao L, Liang J, Ozer A, Leung AKY, Lis JT, Yu H. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers. Nat biotechnol. 2022;40(7):1056–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Abramov S, Boystov A, Bykova D, Penzar DD, Yevshin I, Kolmykov SK, Fridman MV, Favorov AV, Vorontsov IE, Baulin E, Kolpakov FF, Makeev V, Kulakovskiy IV. Landscape of allele-specific transcription factor binding in the human genome. Nat Commun. 2021;12(1):2751. https://doi.org/10.1038/s41467-021-23007-0.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Van Arensbergen J, Pagie L, FitzPatrick VD, de Haas M, Baltissen MP, Comoglio F, van der Weide R, Teunissen H, Vosa U, Franke L, de Wit E, Vermeulen M, Bussemaer HHJ, van Steensel B. High-throughput identification of human SNPs affecting regulatory element activity. Nat Genet. 2019;51(7):1160–9.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, Zhai S, Xu H, Cui H, Yao H, Wang Z, Yi X, Zhang S, Sham PC, Li MJ. QTLbase: an integrative resource for quantitative trait loci across multiple human molecular phenotypes. NAR. 2020;48(D1):D983-991.

    Article  CAS  PubMed  Google Scholar 

  26. Zheng Z, Huang D, Wang J, Zhao K, Zhou Y, Guo Z, Zhai S, Xu H, Cui H, Yao H, Wang Z, Yi X, Zhang S, Sham PC, Li MJ. QTLbase [Internet]; 2016–2022. Available from: http://www.mulinlab.org/qtlbase.

  27. Vierstra J, Lazar J, Sandstrom R, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Haugen E, et al., Digital genomic footprinting [Internet]; 2020. Available from: https://www.vierstra.org/resources/dgf.

  28. Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, Gunes O, Hall P, Hayhurst, et al. The NHGRI-EBI GWAS catalog: knowledgebase and deposition resource. NAR. 2023;51(D1):D977-985.

    Article  CAS  PubMed  Google Scholar 

  29. Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, Güneş O, Hall P, Hayhurst J, et al.., NHGRI-EBI GWAS catalog [Internet]; 2022. Available from: https://www.ebi.ac.uk/gwas/.

  30. Wang F, Bai X, Wang Y, Jiang Y, Ai B, Zhang Y, Liu Y, Xu M, Wang Q, Han X, et al. ATACdb: a comprehensive human chromatin accessibility database. Nucleic Acids Res. 2021;49(D1):D55–64.

    Article  CAS  PubMed  Google Scholar 

  31. Wang F, Bai X, Wang Y, Jiang Y, Ai B, Zhang Y, Liu Y, Xu M, Wang Q, Han X, et al., ATACdb [Internet]; 2020. Available from: https://bio.liclab.net/ATACdb/.

  32. Kolmykov S, Yevshin I, Kuulyashov M, Sharipov R, Kondrakhin Y, Makeev VJ, Kulakovskiy IV, Kel A, Kolpakov F. GTRD: an integrated view of transcription regulation. NAR. 2021;49(D1):D104–11.

    Article  CAS  PubMed  Google Scholar 

  33. Kolmykov S, Yevshin I, Kuulyashov M, Sharipov R, Kondrakhin Y, Makeev VJ, Kulakovskiy IV, Kel A, Kolpakov F. GTRD [Internet]; 2021. Available from: https://gtrd.biouml.org.

  34. Biddie SC, John S, Sabo PJ, Thurman RE, Johnson TA, Schiltz RL, Miranda TB, Sung MH, Trump S, Lightman SL, et al. Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. Mol Cell. 2011;43(1):145–55. https://doi.org/10.1016/j.molcel.2011.06.016. (PMID: 21726817).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Meuleman W, Muratov A, Rynes E, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Teodosiadis A, et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature. 2020;584:244–51. https://doi.org/10.1038/s41586-020-2559-3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Meuleman W, Muratov A, Rynes E, Halow J, Lee K, Bates D, Diegel M, Dunn D, Neri F, Teodosiadis A, et al Index and biological spectrum of human DNase I hypersensitive sites [Internet]; 2020. Available from: https://www.meuleman.org/research/dhsindex/

  37. Smedley D, Schubach M, Jacobsen JOB, Köhler S, Zemojtel T, Spielmann M, Jäger M, Hochheiser H, Washington NL, McMurry JA, et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am J Hum Genet. 2016;99:595–606. https://doi.org/10.1016/j.ajhg.2016.07.005.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Tanigawa Y, Dyer ES, Bejerano G. WhichTF is functionally important in your open chromatin data? PLoS Comput Biol. 2022;18(8): e1010378. https://doi.org/10.1371/journal.pcbi.1010378.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Friman, ET. PeakPredict. Zenodo. 2024. https://zenodo.org/doi/10.5281/zenodo.12706471.

  40. Sloan CA, Chan ET, Davidson JM, Malladi VS, Strattan JS, Hitz BC, Gabdank I, Narayanan AK, Ho M, Lee BT et al., ENCODE portal [Internet]; 2016. Available from: https://www.encodeproject.org/.

  41. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. NAR. 2001;21(1):308–11.

    Article  Google Scholar 

  42. Quinlan AR and Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26(6):841–2

  43. Huang M, Wangg Y, Yangg M, Yan J, Yangg H, Zhuangg W, Xu Y, Koeffler HP, Lin DC, Chen X. dbInDel [Internet]; 202. Available from: http://enhancer-indel.cam-su.org.

  44. Yao L, Liang J, Ozer A, Leung AKY, Lis JT, Yu H. PINTS web portal [Internet]; 2022. Available from: https://pints.yulab.org.

  45. Friman, ET. PeakPredict. Github .2024. https://github.com/efriman/PeakPredict.

  46. Rai V, Quang DX, Erdos MR, Cusanovich DA, Daza RM, Narisu N, Zou LS, Didion JP, Guan Y, Shendure J, et al. Single-cell ATAC-Seq in human pancreatic islets and deep learning upscaling of rare cells reveals cell-specific type 2 diabetes regulatory signatures. Mol Metab. 2020;32:109–21. https://doi.org/10.1016/j.molmet.2019.12.006.

    Article  CAS  PubMed  Google Scholar 

  47. Torres JM, Abdalla M, Payne A, Fernandez-Tajes J, Thurner M, Nylander V, Gloyn AL, Mahajan A, McCarthy MI. A multi-omic integrative scheme characterizes tissues of action at loci associated with type 2 diabetes. Am J Hum Genet. 2020;107:1011–28. https://doi.org/10.1016/j.ajhg.2020.10.009.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Breeze CE, Haugen E, Reynolds A, Teschendorff A, van Dongen J, Lan Q, Rothman N, Bourque G, Dunham I, Beck S, et al. Integrative analysis of 3604 GWAS reveals multiple novel cell type-specific regulatory associations. Genome Biol. 2022;23:13. https://doi.org/10.1186/s13059-021-02560-3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Garrett-Sinha LA. Review of Ets1 structure, function, and roles in immunity. Cell Mol Life Sci. 2013;70:3375–90. https://doi.org/10.1007/s00018-012-1243-7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Schwartz AM, Demin DE, Vorontsov IE, Kasyanov AS, Putlyaeva LV, Tatosyan KA, Kulakovskiy IV, Kuprash DV. Multiple single nucleotide polymorphisms in the first intron of the IL2RA gene affect transcription factor binding and enhancer activity. Gene. 2017;603:50–6.

    Article  Google Scholar 

  51. Johansson Å, Rask-Andersen M, Karlsson T, Ek WE. (2019) Genome-wide association analysis of 350 000 Caucasians from the UK Biobank identifies novel loci for asthma, hay fever and eczema Hum Mol Genet. 28:4022–4041. https://doi.org/10.1093/hmg/ddz175.

  52. Portelli MA, Dijk FN, Ketelaar ME, Shrine N, Hankinson J, Bhaker S, Grotenboer NS, Obeidat M, Henry AP, Billington CK, et al. Phenotypic and functional translation of IL1RL1 locus polymorphisms in lung tissue and asthmatic airway epithelium. JCI Insight. 2020;5: e132446. https://doi.org/10.1172/jci.insight.132446.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Daya M, Rafaels N, Burnetti TM, Chhavan S, Levin AM, Shetty A, Gignoux CR, Boorgula MP, Wojcik G, et al. Association study in African-admixed populations across the Americas recapitulates asthma risk loci in non-African populations. Nat Commun. 2019;10(1):880. https://doi.org/10.1038/s41467-019-08469-7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Trajanoska K, Bherer C, Taliun D, Zhouu S, Richards JB, Mooser V. From target discovery to clinical drug development with human genetics. Nature. 2023;620(7975):737–45.

    Article  CAS  PubMed  Google Scholar 

  55. Dong S, Zhao N, Spragins E, Kagda MS, Li M, Assis P, Jolanki O, Luo Y, Cherry JK, Boyle AP et al. Annotating and prioritizing human non-coding variants with RegulomeDB v.2 Nat Genet. 20223;55(5):724–726

  56. Downes DJ, Cross AR, Hua P, Roberts N, Schwessinger R, Cutler AJ, Munis AM, Brown J, Mielczarek O, de Andrea CE, et al. Identification of LZTFL1 as a candidate effector gene at a COVID-19 risk locus. Nat Genet. 2021;53(11):1606–15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Wu H, Nord AS, Akiyama JA, Shoukry M, Afzal V, Rubin EM, et al. Tissue-specific RNA expression marks distant-acting developmental enhancers. PLoS Genet. 2014;10(9):e1004610. https://doi.org/10.1371/journal.pgen.1004610.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Lee SA, Kristjánsdóttir K, Kwak H. eRNA co-expression network uncovers TF dependency and convergent cooperativity. Sci Rep. 2023;13(1):19085.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Stefan K, Barksi A. Cis-regulatory atlas of primary human CD4+ T cells. BMC Genomics. 2023;24(1):253. https://doi.org/10.1186/s12864-023-09288-3.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Vockley CM, D’Ippolito AM, McDowell IC, Majoros WH, Safi A, Song L, Crawford GE, Reddy TE. Direct GR binding sites potentiate clusters of TF binding across the human genome. Cell. 2016;166(5):1269–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Mercer TR, Edwards SL, Clark MB, Nephh SJ, Wang H, Stergachis AB, John S, Sandstrom R, Li G, Sandhu KS, et al. DNase I-hypersensitive exons colocalize with promoters and distal regulatory elements. Nat Genet. 2013;45(8):852–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Partridge EC, Chhetri SB, Prokop JW, Ramaker RC, Jansen CS, Goh ST, Machiewicz M, Newberry KM, Brandsmeier LA, Meadows SK, et al. Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature. 2020;583(7818):720–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Sung MH, Guertin MJ, Baek S, Hager GL. DNase footprint signatures are dictated by factor dynamics and DNA sequence. Mol Cell. 2014;56(2):275–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Oh KS, Ha J, Baek S, Sung MH. XL-DNase-seq: improved footprinting of dynamic transcription factors. Epigenetics Chromatin. 2019;12(1):30. https://doi.org/10.1186/s13072-019-0277-6.

    Article  PubMed  PubMed Central  Google Scholar 

  65. Calviello AK, Hirsekorn A, Wurmus R, Yusuf D, Ohler U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modelling. Genome Biol. 2019;20(1):42.

    Article  Google Scholar 

  66. Kumasaka N, Knights AJ, Gaffney DJ. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. Nat Genet. 2016;48(2):206–13.

    Article  CAS  PubMed  Google Scholar 

  67. Huang M, Wangg Y, Yangg M, Yan J, Yangg H, Zhuangg W, Xu Y, Koeffler HP, Lin DC, Chen X. dbInDel: a database of enhancer-associated insertion and deletion variants by analysis of H3K27ac ChIP-Seq. Bioinformatics. 2020;36(5):1649–51.

    Article  CAS  PubMed  Google Scholar 

  68. Abramov S, Boystov A, Bykova D, Penzar DD, Yevshin I, Kolmykov SK, Fridman MV, Favorov AV, Vorontsov IE, Baulin E, Kolpakov FF, Makeev V, Kulakovskiy IV. ADASTRA [Internet]; 2021. Available from: https://adastra.autosome.org/bill-cipher/downloads?releaseName=Zanthar.

  69. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F et al. The UCSC Genome Browser Database: update 2006. NAR 2006;34(D):D590–8

  70. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: machine learning in Python. J Machine Learn Res. 2011;12:2825–30.

    Google Scholar 

  71. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inform Process Syst. 2017;30. https://doi.org/10.48550/arXiv.1705.07874.

  72. Steinhaus R, Robinson PN, Seelow D. FABIAN-variant: predicting the effects of DNA variants on transcription factor binding. NAR. 2022;50(W1):W322-329.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Huang L, Rosen JD, Sun Q, Chen J, Wheeler MM, Zhou Y, Min Y, Kooperberg C, Conomos MP, Stilp AM, et al. TOP-LD: a tool to explore linkage disequilibrium with TOPMed whole genome sequence data. Am J Human Genet. 2022;109(6):1175–81. https://doi.org/10.1016/j.ajhg.2022.04.006.

    Article  CAS  Google Scholar 

  74. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Review Commons report 1. Early evidence base. 2024. https://doi.org/10.15252/rc.2024235645.

  76. Review Commons report 2. Early evidence base. 2024. https://doi.org/10.15252/rc.2024765375.

  77. Biddie SC. FINDER, GitHub. 2024. https://github.com/sbiddie/FINDER/tree/v1.0.

  78. Biddie SC. FINDER, Zenodo.2024. https://zenodo.org/doi/10.5281/zenodo.12795448.

Download references

Review history

This article was first peer-reviewed at Review Commons. The reviewer reports are available online [75, 76]. The rest of the review history containing the authors’ responses and additional reviewer comments are available as Additional file 2.

Peer review information

Wenjing She was the primary editor of this article at Genome Biology and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Funding

SCB is funded by Chief Science Office and NHS Education Scotland (PCL/20/02). ETF was supported by a fellowship from the Swiss National Science Foundation (P500PB_206805). GW is supported by an MRC PhD studentship and WAB are supported by an MRC University Unit grant (MC_UU_000035/7).

Author information

Authors and Affiliations

Authors

Contributions

SCB and WAB conceived the study. SCB, GW, EFH, and ETF performed data analyses. SCB wrote the manuscript. All authors were involved in the preparation and editing of the manuscript.

Corresponding authors

Correspondence to Simon C. Biddie or Wendy A. Bickmore.

Ethics declarations

Ethics approval and consent to participate

All datasets analyzed have been previously published and undergone ethical approval where appropriate. No ethical approval was required for this study.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biddie, S.C., Weykopf, G., Hird, E.F. et al. DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants. Genome Biol 25, 208 (2024). https://doi.org/10.1186/s13059-024-03352-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13059-024-03352-1

Keywords