Chromosome contacts in activated T cells identify autoimmune disease-candidate genes

Background Autoimmune disease-associated variants are preferentially found in regulatory regions in immune cells, particularly CD4+ T cells. Linking such regulatory regions to gene promoters in disease relevant cell contexts facilitates identification of candidate disease genes. Results Within four hours, activation of CD4+ T cells invokes changes in histone modifications and enhancer RNA transcription that correspond to altered expression of the interacting genes identified by promoter capture Hi-C (PCHi-C). By integrating PCHi-C data with genetic associations for five autoimmune diseases we prioritised 245 candidate genes with a median distance from peak signal to prioritised gene of 153 kb. Just under half (108/245) prioritised genes related to activation-sensitive interactions. This included IL2RA, where allele-specific expression analyses were consistent with its interaction-mediated regulation, illustrating the utility of the approach. Conclusions Our systematic experimental framework offers an alternative approach to candidate causal gene identification for variants with cell state-specific functional effects, with achievable sample sizes.


Background
Genome-wide association studies (GWAS) in the last decade have associated 324 distinct genomic regions to at least one and often several autoimmune diseases (http://www.immunobase.org). The majority of associated variants lie outside genes [1] and presumably tag regulatory variants acting on nearby or more distant genes [2,3]. Progress from GWAS discovery to biological interpretation has been hampered by lack of systematic methods to define the gene(s) regulated by a given variant. The use of Hi-C [4] and capture Hi-C to link GWAS identified variants to their target genes for breast cancer [5] and autoimmune diseases [6] using cell lines, has highlighted the potential for mapping long range interactions in advancing our understanding of disease association. The observed cell specificity of these interactions indicates a need to study primary disease-relevant human cells, and investigate the extent to which cell state may affect inference.
Integration of GWAS signals with cell-specific chromatin marks has highlighted the role of regulatory variation in immune cells [7], and in particular CD4 + T cells, in autoimmune diseases [8]. Concordantly, differences in DNA methylation of immune-related genes have been observed in CD4 + T cells from autoimmune disease patients compared to healthy controls [9,10]. CD4 + T cells are at the centre of the adaptive immune system and exquisite control of activation is required to guide CD4 + T cell fate through selection, expansion and differentiation into one of a number of specialised subsets. Additionally, the prominence of variants in physical proximity to genes associated with T cell regulation in autoimmune disease GWAS and the association of human leukocyte antigen haplotypes have suggested that control of T cell activation is a key etiological pathway in development of many autoimmune diseases [11].
We explored the effect of activation on CD4 + T cell gene expression, chromatin states and chromosome conformation. PCHi-C was used to map promoter interacting regions (PIRs), and to relate activationinduced changes in gene expression to changes in chromosome conformation and transcription of PCHi-C linked enhancer RNAs (eRNAs). We also fine mapped the most probable causal variants for five autoimmune diseases, autoimmune thyroid disease (ATD), coeliac disease (CEL), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE) and type 1 diabetes (T1D). We integrated these sources of information to derive a systematic prioritisation of candidate autoimmune disease genes.

A time-course expression profile of early CD4 + T cell activation
We profiled gene expression in CD4 + T cells from 20 healthy individuals across a 21 hour activation timecourse, and identified eight distinct gene modules by clustering these profiles (Fig. 1, Additional File 1: Table S1). This experimental approach focused on much earlier events than previous large time-course studies (eg 6 hours -8 days [12]) and highlights the earliest changes that are either not seen or are returning towards baseline by 6 hours (Additional File 2: Fig. S1). Gene set enrichment analysis using MSigDB Hallmark gene sets [13] demonstrated that these modules captured temporally distinct aspects of CD4 + T cell activation. For example, negative regulators of TGF-beta signalling were rapidly upregulated, and returned to baseline by 4 hours. Interferon responses, inflammatory responses and IL-2 and STAT5 signalling pathways showed a more sustained upregulation out beyond 6 hours, while fatty acid metabolism was initially downregulated in favour of oxidative phosphorylation.

PCHi-C captures dynamic enhancer-promoter interactions
We examined activated and non-activated CD4 + T cells purified from healthy individuals in more detail at the four-hour time point, at which the average fold change of genes related to cytokine signalling and inflammatory response was maximal, using total RNA sequencing, histone mark chromatin immunoprecipitation sequencing (ChIP-seq) and PCHi-C. Of 8,856 genes identified as expressed (see Methods) in either condition (non-activated or activated), 25% were up-or down-regulated (1,235 and 952 genes respectively, FDR<0.01, Additional File 3: Table S2). We used PCHi-C to characterise promoter interactions in activated and non-activated CD4 + T cells. Our capture design baited 22,076 HindIII fragments (median length 4 kb) which contained the promoters of 29,131 annotated genes, 18,202 of which are protein coding (Additional File 4 : Table S3). We detected 283,657 unique PCHi-C interactions with the CHiCAGO pipeline [14], with 55% found in both activation states, and 22% and 23% found only in non-activated and only in activated CD4 + T cells, respectively (Additional File 5 : Table   S4). 11,817 baited promoter fragments were involved in at least one interaction, with a median distance between interacting fragments of 324 kb. Each interacting promoter fragment connected to a median of eight promoter interacting regions (PIRs), while each interacting PIR was connected to a median of one promoter fragment (Additional File 2: Fig. S2).
We compared our interaction calls to an earlier ChIA-PET dataset from non-activated CD4 + T cells [15] and found we replicated over 50% of the longer range interactions (100 kb or greater), with replication rates over ten-fold greater for interactions found in non-activated CD4 + T cells compared to interactions found only in erythroblasts or megakaryocytes (Additional File 2: Fig. S3). We also compared histone modification profiles in interacting fragments in CD4 + T cells to interacting fragments found in erythroblasts or megakaryocytes. Both promoter fragments and, to a lesser extent, PIRs were enriched for histone modifications associated with transcriptionally active genes and regulatory elements (H3K27ac, H3K4me1, H3K4me3; Additional File 2: Fig. S4), and changes in H3K27ac modifications at both promoter fragments and PIRs correlated with changes in gene expression upon activation. PIRs, but not promoter fragments, showed significant overlap with regions previously annotated as enhancers [16].
We found that absolute levels of gene expression correlated with the number of PIRs (Additional File 2: Fig. S5a, rho=0.81), consistent with recent observations [15]. We defined a subset of PCHi-C interactions that were specifically gained or lost upon activation (2,334 and 1,866 respectively, FDR<0.01) and found that the direction of change (gain or loss) at these differential interactions agreed with the direction of differential expression (up-or down-regulated) at the module level (Fig. 2). We further found that dynamic changes in gene expression upon activation correlated with changing numbers of PIRs. Notably, the pattern was asymmetrical, with a gained interaction associating with approximately two fold the change associated with a lost interaction (Fig. 3a). Given the >6 hour median half life of mRNAs expressed in T cells [17] (Additional File 2: Fig. S5b), it is possible that the relatively smaller changes associated with lost interactions are due to the persistence of downregulated transcripts at the early stages of T cell activation.
As we sequenced total RNA without a poly(A) selection step, we were able to detect regulatory region RNAs (regRNAs), which are generally non-polyadenylated and serve as a mark for promoter and enhancer activity [18]. We defined 6,147 "expressed" regRNAs (see Methods) that mapped within regulatory regions defined by a 15 state ChromHMM [19] model (Additional File 2: Fig. S6) and validated them by comparison to an existing cap analysis of gene expression (CAGE) dataset [20] which has been successfully used to catalog active enhancers. [21] We found 2,888/3,897 (74%) regRNAs expressed in non-activated cells overlap CAGE defined enhancers. This suggests that the combination of chromatin state annotation and total RNA-seq data presents an alternative approach to capture active enhancers.
Almost half (48%) of expressed regRNAs showed differential expression after activation (2,254/681 up/down regulated; FDR<0.01). To determine whether activity at these regRNAs could be related to that at PCHi-C linked genes, we focused attention on a subset of 640 intergenic regRNAs, which correspond to a definition of eRNAs [22]. Of these, 404 (63%) overlapped PIRs detected in CD4 + T cells and we found significant agreement in the direction of fold changes at eRNAs and their PCHi-C linked protein coding genes in activated CD4 + T cells (p<0.0001, Fig. 3b). We also observed a synergy between regRNA expression and the estimated effect of a PIR on expression with a gain or loss of a PIR overlapping a differentially regulated regRNA having the largest estimated effect on gene expression (Fig. 3c), supporting a sequential model of gene activation [23]. While regRNA function is still unknown [22], our results demonstrate the detection, by PCHi-C, of condition-specific connectivity between promoters and enhancers involved coordinating gene regulation.

PCHi-C-facilitated mapping of candidate disease causal genes
We defined an experimental framework to integrate PCHi-C interactions with GWAS data to map candidate disease causal genes for autoimmune diseases (Fig. 4). First, to confirm that PCHi-C interactions inform interpretation of autoimmune disease GWAS, we tested whether PIRs were enriched for autoimmune disease GWAS signals by testing for different distributions of GWAS p values in PIRs of activated or non-activated CD4 + T cells compared to non-lymphoid cells (megakaryocytes and erythroblasts) and then in PIRs of activated compared to non-activated CD4 + T cells. To perform the test, we used blockshifter, which accounts for correlation between (1) neighbouring variants in the GWAS data and (2) neighbouring HindIII fragments in the interacting data by rotating one dataset with respect to the other, as previously proposed [24]. This method appropriately controls type 1 error rates, in contrast to methods based on counting associated SNP/PIRs which ignore correlation, such as a Fisher's exact test (Additional File 2: Fig. S7). We found autoimmune GWAS signals were enriched in CD4 + T cell PIRs compared to non-autoimmune GWAS signals (Wilcoxon p = 2.5x10 -7 ) and preferentially so in activated versus non-activated cells (Wilcoxon p = 4.8x10 -5 ; Fig. 4).
Next, we fine-mapped causal variants for five autoimmune diseases using genetic data from a dense targeted genotype array, the ImmunoChip (ATD, CEL, RA, T1D), and summary data from GWAS data imputed to 1000 Genomes Project (RA, SLE; Additional File 6 : Table S5). For the ImmunoChip datasets, with full genotype data, we also imputed to 1000 Genomes and used a Bayesian fine mapping approach [25] which avoids the need for stepwise regression or assumptions of single causal variants and which provides a measure of posterior belief that any given variant is causal by aggregating posterior support over models containing that variant. Variant-level results are given in Additional Files 7-8 : Tables S6a-S6b, and show that of 73 regions with genetic association signals to at least one disease (minimum p < 5x10 -8 ; 106 disease associations), nine regions have strong evidence that they contain more than one causal variant (posterior probability > 0.5), among them the well studied region on chromosome 10 containing the candidate gene IL2RA [25]. For the GWAS summary data, we make the simplifying assumption that there exists a single causal variant in any LD-defined genetic region and again generate posterior probabilities that each variant is causal [26]. To integrate these variant level data with PCHi-C interactions and prioritize protein coding genes as candidate causal genes for each autoimmune disease, we calculated gene-level posterior support by summing posterior probabilities over all models containing variants in PIRs for a given gene promoter, within the promoter fragment or within its immediate neighbour fragments. Neighbouring fragments are included because of limitations in the ability of PCHi-C to detect very proximal interactions (within a region consisting of the promoter baited fragment and one HindIII fragment either side). The majority of gene scores were close to 0 (99% of genes have a score <0.05) and we chose to use a threshold of 0.5 to call genes prioritised for further investigation. Having both ImmunoChip and summary GWAS data for RA allowed us to compare the two methods. Overlap was encouraging: of 24 genes prioritised for RA from ImmunoChip, 20 had a gene score > 0.5 in the GWAS prioritisation, a further three had gene scores > 0.2. The remaining gene, MDN1, corresponded to a region which has a stronger association signal in the RA-ImmunoChip than RA-GWAS dataset, which may reflect the greater power of direct genotyping versus imputation, given that the RA-ImmunoChip signal is mirrored in ATD and T1D (Additional File 2: Fig. S8). We prioritised a total of 245 unique protein coding genes, 108 of which related to activation sensitive interactions (Additional Files 9-10 : Tables S7a-7b, Fig. 4). Of 118 prioritised genes which could be related through interactions to a known susceptibility region, 63 (48%) lay outside that disease susceptibility region. The median distance from peak signal to prioritised gene was 153 kb. Note that prioritisation can be one (variant)-to-many (genes) because a single PIR can interact with more than one promoter, and promoter fragments can contain more than one gene promoter. Note also that the score reflects both PCHi-C interactions and the strength and shape of association signals (Additional File 2: Fig. S9), therefore a subset of prioritised genes relate to an aggregation over sub-genomewide significant GWAS signals. This is therefore a "long" list of prioritised genes which requires further filtering ( Table 1). One hundred and seventy nine (of 245) prioritised genes were expressed in at least one activation state; we highlight specifically the subset of 118 expressed genes which can be related to a genome-wide significant GWAS signal through proximity of a genome-wide significant SNP (p<5x10 -8 ) to a PIR. Of these, 82 were differentially expressed, 48 related to activationsensitive interactions and 63 showed overlap of GWAS fine-mapped variants with an expressed eRNA (Additional File 9 : Table S7a). Note that group 2 is a subset of group 1, group 3 is a subset of group 2, and groups 4, 5 and 6 are all subsets of group 3 but not necessarily of each other.
Taken together, our results reflect the complexity underlying gene regulation, and the context-driven effects that common autoimmune disease-associated variants may have on candidate genes. Our findings are consistent with, and extend, previous observations [7,8] and we highlight six examples which exemplify both activation-specific and activation-invariant interactions.

Exemplar regions
Here we highlight specific examples of prioritised genes with plausible autoimmune disease candidacy which illustrate three characteristics we found frequently, namely (1) the identification of candidate genes some distance from association signals, and (2) the tendency for multiple gene promoters to be identified as interacting with the same sets of disease associated variants and (3) genes prioritised in only one state of activation.
As an example of the first, CEL has been associated with a region on chromosome 1q31.2, for which RGS1 has been named as a causal candidate due to proximity of associated variants to its promoter [27].
Sub-genome-wide significant signals for T1D (min. p=1.5x10 -6 ) across the same SNPs which are associated with CEL have been interpreted as a colocalising T1D signal in the region [28]. RGS1 has recently been shown to have a role in the function of T follicular helper cells in mice [29], the frequencies of which and their associated IL-21 production have been shown to be increased in T1D patients [30].
However, our analysis also prioritises, in activated T cells, the strong functional candidate genes TROVE2 and UCHL5, over half a megabase distant and with three intervening genes not prioritised (Fig. 5).
TROVE2 is significantly upregulated upon activation (FDR=0.005) and encodes Ro60, an RNA binding protein that indirectly regulates type-I IFN-responses by controlling endogenous Alu RNA levels [33]. A global anti-inflammatory effect for TROVE2 expression would fit with its effects on gut (CEL) and pancreatic islets (T1D).
A similar situation is seen in the chromosome 1q32.1 region associated with T1D in which IL10 has been named as a causal candidate gene [34]. Together with IL10, prioritised through proximity of credible SNPs to the IL10 promoter, we prioritised other IL10 gene family members IL19, IL20 and IL24 as well as two members of a conserved three-gene immunoglobulin-receptor cluster (FCMR and PIGR, Additional File 2: Fig. S10). While IL19, IL20 and PIGR were not expressed in CD4 + T cells, FCMR was down-and IL24 and IL10 were up-regulated following CD4 + T cell activation. IL-10 is recognised as an important antiinflammatory cytokine in health and disease [35] and candidate genes FCMR and IL24 are components of a recently identified proinflammatory module in Th17 cells [36].
This, region also exemplifies characteristic 2: candidate causal variants lay in HindIII fragments that interacted with multiple genes. Parallel results have demonstrated co-regulation of multiple PCHi-C interacting genes by a single variant [37], suggesting that disease related variants may act on multiple genes simultaneously, consistent with the finding that regulatory elements can interact with multiple promoters [38][39][40]. This region also shows that clusters of multiple adjacent PIRs can be detected for the same promoter. It remains to be further validated whether all PIRs detected within such clusters correspond to 'causal' interactions or whether some such PIRs are 'bystanders' of strong interaction signals occurring in their vicinity. The use of PCHi-C nonetheless adds considerable resolution compared to simply considering topologically associating domains (TADs), which have a median length of 415 kb in naive CD4 + T cells [37] compared to a median of 37.5 kb total PIR length per gene in non-activated CD4 + T cells (Additional File 2: Fig. S11).
Multiple neighbouring genes were also prioritised on chromosome 16q24.1: EMC8, COX4I1 and IRF8, the last only in activated T cells, for two diseases: RA and SLE (Additional File 2: Fig. S12). Candidate causal variants for SLE and RA fine-mapped to distinct PIRs, yet all these PIRs interact with the same gene promoters, suggesting that interactions, possibly specific to different CD4 + T cell subsets, may allow us to unite discordant GWAS signals for related diseases [6,41,42]. EMC8 and COX4I1 RNA expression was relatively unchanged by activation, whereas IRF8 expression was upregulated 97-fold, coincident with the induction of 16 intergenic IRF8 PIRs, four of which overlap autoimmune disease fine-mapped variants. Although the dominant effect of IRF8 is to control the maturation and function of the mononuclear phagocytic system [43], a T cell-intrinsic function regulating CD4 + Th17 differentiation has been proposed [44]. Our data further link the control of Th17 responses to susceptibility to autoimmune disease including RA and SLE [45].
Another notable example AHR, was one of the 38 genes we prioritised in only one state of activation (characteristic 3, Fig. 4): for rheumatoid arthritis (RA), AHR was prioritised specifically in activated T cells rather than non-activated T cells (Additional File 2: Fig. S13). AHR is a high affinity receptor for toxins in cigarette smoke that has been linked to RA previously through differential expression in synovial fluid of patients, though not through GWAS [48]. Our analysis prioritises AHR for RA due to a sub-genome-wide significant signal (rs71540792, p=2.9x10 -7 ) and invites attempts to validate the genetic association in additional RA patients.

Interaction-mediated regulation of IL2RA expression
Given our prior interest in the potential for autoimmune-disease associated genetic variants to regulate IL2RA expression [42], we were interested to see PCHi-C interactions between some of these variants and the IL2RA promoter, and attempted to confirm the predicted functional effects on IL2RA expression experimentally. IL2RA encodes CD25, a component of the key trimeric cytokine receptor that is essential for high-affinity binding of IL-2, regulatory T cell survival and T effector cell differentiation and function [49]. Multiple variants in and near IL2RA have been associated with a number of autoimmune diseases [34,[50][51][52]. We have previously fine-mapped genetic causal variants for T1D and multiple sclerosis (MS) in the IL2RA region [25], identifying five groups of SNPs in intron 1 and upstream of IL2RA, each of which is likely to contain a single disease causal variant. Out of the group of eight SNPs previously denoted "A" [25], three (rs12722508, rs7909519 and rs61839660) are located in an area of active chromatin in intron 1, within a PIR that interacts with the IL2RA promoter in both activated and non-activated CD4 + T cells (Fig. 6a). These three SNPs are also in LD with rs12722495 (r 2 >0.86) that has previously been associated with differential surface expression of CD25 on memory T cells [42] and differential responses to IL-2 in activated Tregs and memory T cells [53]. We measured the relative expression of IL2RA mRNA in five individuals heterozygous across all group "A" SNPs and homozygous across most other associated SNP groups, in a four-hour activation time-course of CD4 + T cells. Allelic imbalance was observed consistently for two reporter SNPs in intron 1 and in the 3' UTR in non-activated CD4 + T cells in each individual ( Fig. 6b; Additional File 2: Fig. S14a), validating a functional effect of the PCHi-C-derived interaction between this PIR and the IL2RA promoter in non-activated CD4 + T cells .
While the allelic imbalance was maintained in non-activated cells cultured for 2-4 hours, the imbalance was lost in cells activated under our in vitro conditions. Since increased CD25 expression with rare alleles at group "A" SNPs has previously been observed on memory CD4 + T cells but not the naive or Treg subsets that are also present in the total CD4 + T cell population [42], we purified memory cells from 8 and across the group A SNPs in intron 1, could account for the loss of allelic imbalance. These results emphasise the importance of steady-state CD25 levels on CD4 + T cells for the disease association mediated by the group A SNPs, levels which will make the different subsets of CD4 + T cells more or less sensitive to the differentiation and activation events caused by IL-2 exposure in vivo [54].

Discussion
Our results illustrate the changes in chromosome conformation in a single cell type in response to a single activation condition, in addition to providing support for the candidacy of certain genes and sequences in GWAS regions as causal for disease. That the PCHi-C technique can indeed link enhancers to their target genes is supported by our evidence that the direction of fold changes at eRNAs is connected to those at their PCHi-C linked protein coding genes. Recent attempts to link GWAS signals to variation in gene expression in primary human cells have sometimes found only limited overlap [55][56][57]. One explanation may be that these experiments miss effects in specific cell subsets or states, especially given the transcriptional diversity between the many subsets of memory CD4 + T cells [58]. We highlight the complex nature of disease association at the IL2RA region where additional PIRs for IL2RA gained upon activation overlap other fine-mapped disease-causal variants (Fig. 6a), suggesting that other allelicallyimbalanced states may exist in activated cells, which may also correspond to altered disease risk. For example, the PIR containing rs61839660, a group A SNP, also contains an activation eQTL for IL2RA expression in CD4 + T effectors [59] marked by rs12251836, which is unlinked to the group A variants and was not associated with T1D [59]. Furthermore, rs61839660 itself has recently been reported as a QTL for methylation of the IL2RA promoter as well as an eQTL for IL2RA expression in whole blood [57,60]. The differences between CD25 expression in different T cell subsets [61,62], and the rapid activation-induced changes in gene and regulatory expression, chromatin marks and chromosome interactions we observe, imply that a large diversity of cell types and states will need to be assayed to fully understand the identity and effects of autoimmune disease causal variants.
Our approach, like others such as eQTL analyses and integration of GWAS variants with chromatin state information, offers a view of disease through the prism of purified cell subsets in specific states of activation. However, a more complete range of cell types and activation states will be needed for the comprehensive understanding of complex diseases for which multiple cell types are aetiologically involved. It will be challenging to assay this greater diversity of cell types and states in the large numbers of individuals needed for traditional eQTL studies, particularly for cell-type or condition-specific eQTLs that have been shown to generally have weaker effects [63,64]. Allele-specific expression (ASE) is a more powerful design to quantify the effects of genetic variation on gene expression with modest sample sizes [65] and the targeted ASE approach that we adopt enables testing of individual variants or haplotypes at which donors are selected to be heterozygous, while controlling for other potentially related variants at which donors are selected to be homozygous.

Conclusions
Here we have presented an approach for connecting disease-associated variants derived from GWAS with putative target genes based on promoter interactome maps obtained with PCHi-C. this greater theneeded By using statistical fine mapping of GWAS data, integrated with PCHi-C, to highlight both likely disease causal variants and their potential target genes, we enable the design of targeted ASE analyses for functional confirmation of individual effects. This systematic experimental framework offers an alternative approach to candidate causal gene identification for variants with cell state-specific functional effects, with achievable sample sizes.

CD4 + T cell purification and activation, preparation for genomics assays
Blood samples were obtained from healthy donors selected from the Cambridge BioResource. Donors were excluded if they were diagnosed with autoimmune disease or cancer, were receiving immunosuppressants, cytotoxic agents or intravenous immunoglobulin or had been vaccinated or received antibiotics in the 2-4 weeks preceding the blood donation. CD4 + T cells were isolated from whole blood For PCHi-C [37], DNA was digested overnight with HindIII, end labeled with biotin-14-dATP and ligated in preserved nuclei. De-crosslinked DNA was sheared to an average size of 400 bp, end-repaired and adenine-tailed. Following size selection (250-550 bp fragments), biotinylated ligation fragments were immobilized, ligated to paired-end adaptors and libraries amplified (7-8 PCR amplification rounds).
Biotinylated 120-mer RNA baits targeting both ends of HindIII restriction fragments that overlap Ensembl-annotated promoters of protein-coding, noncoding, antisense, snRNA, miRNA and snoRNA transcripts were used to capture targets [67]. After enrichment, the library was further amplified (4 PCR cycles) and sequenced on the Illumina HiSeq 2500 platform. Each PCHi-C library was sequenced over 3 lanes generating 50 bp paired-end reads.

PCHi-C interaction calls
Raw sequencing reads were processed using the HiCUP pipeline [68] and interaction confidence scores were computed using the CHiCAGO pipeline [14] as previously described [37]. We considered the set of interactions with high confidence scores (> 5) in this paper.
Raw PCHi-C read counts from 3 replicates and 2 conditions were transformed into a matrix, and a trimmed mean of M-values normalization was applied to account for library size differences.
Subsequently, a voom normalization was applied to log-transformed counts in order to estimate precision weights per contact, and differential interaction estimates were obtained after fitting a linear model on a paired design, using the limma Bioconductor R package [69].

Microarray measurement of gene expression
We recruited 20 healthy volunteers from the Cambridge BioResource. Total CD4 + T cells were isolated from whole blood within 2 hours of venepuncture by RosetteSep (StemCell technologies). To assess the transcriptional variation in response to TCR stimulation, 10 6 CD4 + T cells were cultured in U-bottom 96well plates in the presence or absence of human T activator CD3/CD28 beads at a ratio of 1 bead : 3 cells.
Cells were harvested at 2 , 4, 6 or 21 hours post-stimulation, or after 0, 6 or 21 hours in the absence of stimulation. Three samples from the 6 hour unstimulated time point were omitted from the study due to insufficient cell numbers, and a further four samples were dropped after quality control, resulting in a total of 133 samples that were included in the final analysis. RNA was isolated using the RNAeasy kit (Qiagen) according to the manufacturer's instructions.
cDNA libraries were synthesized from 200 ng total RNA using a whole-transcript expression kit (Ambion) according to the manufacturer's instructions and hybridized to Human Gene 1.1 ST arrays (Affymetrix). Microarray data were normalized using a variance stabilizing transformation [70] and differential expression was analysed in a paired design using limma [69]. Genes were clustered into modules using WGCNA [71]. Clustering code is available at https://github.com/chr1swallace/cd4pchic/blob/master/make_modules.R.

ChIP sequencing and regulatory annotation
ChIP sequencing reads for all histone modification assays and control experiments were mapped to the reference genome using BWA-MEM [72], a Burrows-Wheeler genome aligner. Samtools [73] was employed to filter secondary and low-quality alignments (we retained all read pair alignments with PHRED score > 40 that matched all bits in SAM octal flag 3, and did not match any bits in SAM octal flag 3840). The remaining alignments were sorted, indexed and a whole-genome pileup was produced for each histone modification, sample and condition triple.

RNA sequencing
Total RNA was isolated using the RNeasy kit (Qiagen) and the concentrations and integrity were quantified using Bioanalyzer (Agilent); all samples reached RINs > 9. 8 After alignment, we employed Samtools [73] to discard reads with an unmapped pair, secondary alignments and low-quality alignments. The resulting read dataset, with an average of 33 million pairedend reads/sample, was sorted and indexed. We used FastQC (v0.11.3, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to ensure all samples had regular GC content (sum of deviations from normal includes less than 15% of reads), base content per position (difference A vs T and G vs C less than 10% at all positions) and kmer counts (no imbalance of kmers with p < 0.01) as defined by the tool. We augmented Ensembl 75 gene annotations with regulatory region definitions obtained from our ChIP-seq analysis described above, and defined them as present in both genome strands due to their bidirectional transcription potential. For each RNA-seq sample, we quantified expression of genomic and regulatory features in a two-step strand-aware process using HTSeq [76]. For each gene we counted the number of reads that fell exactly within its exonic regions, and did not map to other genomic elements. For each regulatory feature we counted the number of reads that fell exactly within its defined boundaries, and did not map to other genomic or regulatory elements.
By construction, this quantification scheme counts each read at most once towards at most one feature.
Furthermore, strand information is essential to be able to assign reads to features in regions with overlapping annotations. For example, distinguishing intronic eRNAs from pre-mRNA requires reads originating from regulatory activity in the opposite strand from the gene.
Feature counts were transformed into a matrix, and a trimmed mean of M-values normalization was applied to account for library size differences, plus a filter to discard features below an expression threshold of < 0.4 counts per million mapped reads in at least two samples, a rather low cutoff, to allow for regulatory RNAs to enter differential expression calculations. This threshold equates to approximately 15 reads, given our mapped library sizes of ~35 million paired-end reads. A voom normalization was applied to log-transformed counts in order to estimate precision weights per gene, and differential expression estimates were obtained after fitting a linear model on a paired design, using the limma Bioconductor R package [69]. There was a strong correlation (rho=0.81) between microarray and RNA-seq fold change estimates at 4 hours.

Comparison of regRNAs to FANTOM CAGE data
We compared expressed regRNA regions detected in our non-activated CD4 + T cell samples versus those
Calling interactions requires correction for the expected higher density of random collisions at shorter distances [77] which are explicitly modelled by CHICAGO [14] used in this study but not in the ChIA-PET study [15]. As a result, we expected a higher false positive rate from the ChIA-PET data at shorter distances.

Regression of gene expression against PIR count and eRNA expression
We related measures of gene expression (absolute log2 counts or log2 fold change) to numbers of PIRs or numbers of PIRs overlapping specific features using linear regression. We used logistic regression to relate agreement between fold change direction at PCHi-C linked protein coding genes and eRNAs. We used robust clustered variance estimates to account for the shared baits for some interactions across genes with the same prey. Enrichment of chromatin marks in interacting baits and prey were assessed by logistic regression modelling of a binary outcome variable (fragment overlapped specific chromatin peak) against a fragment width and a categorical explanatory variable (whether the HindIII fragment was a bait or prey and the cell state the interaction was identified in), using block bootstrapping of baited fragments (https://github.com/chr1swallace/genomic.autocorr) to account for spatial correlation between neighbouring fragments.

GWAS summary statistics
We used a compendium of 31 GWAS datasets [37] (Additional File 6 :  [78] to compute approximate Bayes factors for each variant using the Wakefield approximation [79], and thus posterior probabilities that each variant was causal as previously proposed [80]. The method assumes a normal prior on the population log relative risk centred at 0, and we set the variance of this distribution to 0.04, equivalent to a 95% belief that the true relative risk is between 0.66 and 1.5 at any causal variant. We set the prior probability that any variant is causal for disease to 10 -4 .

Testing of the enrichment of GWAS summary statistics in PIRs using blockshifter
We used the blockshifter method [37] (https://github.com/ollyburren/CHIGP) to test for a difference between variant posterior probability distributions in HindIII fragments with interactions identified in test and control cell types using the mean posterior probability as a measure of central location. Blockshifter controls for correlation within the GWAS data due to LD and interaction restriction fragment block structure by employing a rotating label technique similar to that described in GoShifter [24]  [83] dataset) or genotyping array (the subset of SNPs also on the Illumina 550k array). Where genotyping density is less than full, we used our proposed poor man's imputation (PMI) strategy to fill in Z scores for missing SNPs.
We ran blockshifter, with 1000 null permutations, for each scenario and PMI condition for 4000 simulated GWAS, with a blockshifter superblock gap size parameter (the number of contiguous non-PIR HindIII fragments allowed within one superblock) of between 1 and 20 and supplying numbers of cases and controls from the RA dataset [51].
For comparison we also investigated the behaviour of a naive test for enrichment for the null scenario. We computed a 2x2 table variants according to test and control PIR overlap, and whether a variant's posterior probability of causality exceeded an arbitrary threshold of 0.01, and Fisher's exact test to test for enrichment.

Enrichment of GWAS summary statistics in CD4 + and activated CD4 + PIRs
We compared the following sets using all GWAS summary statistics, with a superblock gap size of 5 (obtained from simulations above) and 10,000 permutations under the null:- Total CD4 + Activated + Total CD4 + NonActivated (test) versus Endothelial precursors + Megakaryocytes (control)  Total CD4 + Activated (test) versus Total CD4 = NonActivated (control).

Variant posterior probabilities of inclusion, full genotype data (ImmunoChip)
We carried out formal imputation to 1000 Genomes Project EUR data using IMPUTE2 [84] and finemapped causal variants in each of the 179 regions where a minimum p < 0.0001 was observed using a stochastic search method which allows for multiple causal variants in a region, (https://github.com/chr1swallace/GUESSFM) [25]. Despite the pre-selection of regions associating with autoimmune diseases on the ImmunoChip, we chose to again set the prior probability that any variant was causal to 10 -4 , to align our analysis with that applied to the GWAS summary data. The prior probability for individual models follows a binomial distribution, according to the number of causal variants represented, so that the prior for each of the ( n k ) k SNP causal variant models was (10 -4 ) k (1-10 -4 ) (n-k) where n is the number of SNPs considered in the region. The posterior probabilities for models that contained variants which overlapped PIRs for each gene were aggregated to compute PIR-level marginal posterior probabilities of inclusion.

Variant posterior probabilities of inclusion, summary statistics
Where we have only summary statistics of GWAS data already imputed to 1000 Genomes, we divided the genome into linkage disequilibrium blocks of 0.1cM based on data from the HapMap project (http://hapmap.ncbi.nlm.nih.gov/downloads/recombination/2011-01_phaseII_B37/). For each region excluding the MHC we use code modified from Giambartolomei et al. [78] to compute approximate Bayes factors for each variant using the Wakefield approximation [79], and thus posterior probabilities that each variant was causal assuming at most one causal variant per region as previously proposed [80].

Computation of gene prioritisation scores
We used the COGS method [37] (https://github.com/ollyburren/CHIGP) to prioritise genes for further analysis. We assign variants to the first of the following three categories it overlaps for each annotated gene, if any 1.
coding variant: the variant overlaps the location of a coding variant for the target gene. 2.
promoter variant: the variant lies in a region baited for the target gene or adjacent restriction fragment. 3. PIR variant: the variant lies in a region overlapping any PIR interacting with the target gene.
We produced combined gene/category scores by aggregating, within LD blocks, over models with a variant in a given set of PIRs (interacting regions), or over HindIII fragments baited for the gene promoter and immediate neighbours (promoter regions), or over coding variants to generate marginal probabilities of inclusion (MPPI) for each hypothesised group. We combine these probabilities across LD blocks, i, using standard rules of probability to approximate the posterior probability that at least one LD block contains a causal variant: Thus the score takes a value between 0 and 1, with 1 indicating the strongest support. We report all results with score > 0.01 in Additional File 9 : Table S7b, but focus in this manuscript on the subset with scores > 0.5.
Because COGS aggregates over multiple signals, a gene may be prioritised because of many weak signals or few strong signals in interacting regions. To predict the expected information for future users of this method, we considered the subset of 76 input regions with genome-wide significant signals (p<5x10 -8 ) in ImmunoChip datasets. We prioritised at least one gene with a COGS score > 0.5 in 35 regions, with a median of three genes/region (interquartile range, IQR = 1.5-4). Equivalent analysis of the genome-wide significant GWAS signals prioritised a median of two genes/region (interquartile range = 1-3). This suggests that this algorithm might be expected to prioritise at least one gene in about half the genomewide significant regions input when run on a relevant cell type.

Whilst components 1 and 2 are fixed for a given gene and trait the contribution of variants overlapping
PIRs varies depending on the tissue context being examined. We developed a hierarchical heuristic method to ascertain for each target gene which was the mostly likely component and cell state. Firstly for each gene we compute the gene score due to genic variants (components 1 + 2) and variants in PIRs (component 3) using all available tissue interactions for that gene. We use the ratio of the genic score to PIRs score in a similar manner to a Bayes factor to decide whether causal variants contributing to the gene score are more likely to lie within the gene or within its associated PIRs. If a genic location is more likely (gene.score ratio >3) we iterate and compare if the gene score due to coding variants (component 1) is more likely than for promoter variants (component 2). Similarly if PIRs ares more likely we compare PIR gene scores for activated vs non-activated cells. If at any stage no branch is substantially preferred over its competitor (ratio of gene scores < 3) we return the previous set as most likely, otherwise we continue until a single cell state/set is chosen. In this way we can prioritize genes based on the overall score and label as to a likely mechanism for candidate causal variants.

Allele-specific expression assays
Total CD4 + T cells were isolated from five healthy donors and activated as described above and were

Statistical analysis of allele-specific expression data
Sequence data was processed using the Methpup package (https://github.com/ollyburren/Methpup) to extract counts of each allele at rs12722495, and rs12244380 (Additional File 15 : Table S10). Individuals were part of a larger cohort genotyped on the ImmunoChip and were phased using snphap (https://github.com/chr1swallace/snphap) to confirm which allele at each SNP was carried on the same chromosome as A2=rs12722495:C or A1=rs12722495:T. Allelic imbalance was quantified as the ratio A2/A1 and was averaged across replicates within individuals using a geometric mean. Allelic ratios in cDNA and gDNA were compared using Wilcoxon rank sum tests. P values are shown in Fig. 6b and

Consent for publication
Not applicable

Availability of data and materials
The datasets generated and/or analysed during the current study are available in repositories or additional files as indicated below:  PCHi-C data are available as raw sequencing reads from EGA   We examined gene expression using microarray in activated and non-activated CD4 + T cells across 21 hours, and assayed cells in more detail at the four hour time point using ChIP-seq, RNA-seq and PCHi-C. n gives the number of individuals † or pools* assayed. b Eight modules of co-regulated genes were identified, and eigengenes are plotted for each individual (solid lines=activated, dashed lines=nonactivated), with heavy lines showing the average eigengene across individuals. We characterized these modules by gene set enrichment analysis within the MSigDB HALLMARK gene sets, and where significant gene sets were found, up to three are shown per module. n is the number of genes in each module.  PCHi-C interactions and enhancer activity predict change in gene expression. a Change in gene expression at protein coding genes (log2 fold change, y axis) correlates with the number of PIRs gained or lost upon activation (x axis). b Fold change at transcribed sequence within the intergenic subset of regulatory regions ("eRNAs") was more likely to agree with the direction of protein coding gene fold change when the two are linked by PCHi-C (red) in activated CD4 + T cells compared to pairs of eRNA and protein coding genes formed without regard to PCHi-C derived interactions (background, grey, p<10 -4 ). Interactions were categorised as control (present only in megakaryocytes and erythroblasts, our control cells), invariant ("invar"; present in non-activated and activated CD4 + T cells), "loss" (present in non-activated but not activated CD4 + T cells, and significantly down-regulated at FDR<0.01) or "gain" (present in activated but not non-activated CD4 + T cells, and significantly up-regulated at FDR<0.01). c gain or loss of PIRs upon activation predicts change in gene expression, with the estimated effect more pronounced if accompanied by up-or down-regulation at an eRNA. Points show estimated effect on gene expression of each interaction and lines the 95% confidence interval. PIRs categorised as in b. eRNAs categorised as no (undetected), invariant ("invar", detected in non-activated and activated CD4 + T cells, differential expression FDR>=0.01), up (up-regulated; FDR<0.01) or down (down-regulated; FDR<0.01).
Bar plot (top) shows the number of interactions underlying each estimate. Note that eRNA=down, PIR=gain (light gray) has only one observation so no confidence interval can be formed and is shown for completeness only. Fig. 4. An experimental framework for identifying disease causal genes. a Before prioritising genes, enrichment of GWAS signals in PCHi-C interacting regions should be tested to confirm the tissue and context are relevant to disease. Then, probabilistic fine mapping of causal variants from the GWAS data can be integrated with the interaction data to prioritise candidate disease causal genes, a list which can be iteratively filtered using genomic datasets to focus on (differentially) expressed genes and variants which overlap regions of open or active chromatin. b Autoimmune disease GWAS signals are enriched in PIRs in CD4 + T cells generally compared to control cells (blockshifter Z score, x axis) and in PIRs in activated compared to non-activated CD4+ T cells (blockshifter Z score, y axis). Text labels correspond to datasets described in Additional File 6 : Table S5. c Genes were prioritised with a COGS score>0.5 across five autoimmune diseases using genome-wide (GWAS) or targeted genotyping array (ImmunoChip) data. The numbers at each node give the number of genes prioritised at that level. Where there is evidence to split into one of two non-overlapping hypotheses (log10 ratio of gene scores>3), the genes cascade down the tree. Act and NonAct correspond to gene scores derived using PCHi-C data only in activated or nonactivated cells, respectively. Where the evidence does not confidently predict which of the two possibilities is more likely, genes are "stuck" at the parent node (number given in brackets). When the same gene was prioritised for multiple diseases, we assigned fractional counts to each node, defined as the proportion of the n diseases for which the gene was prioritised at that node. Because of duplicate results between GWAS and ImmunoChip datasets, the total number of prioritised genes is 252 (see Table  1).  Fig. 6. PCHi-C interactions link the IL2RA promoter to autoimmune disease associated genetic variation, which leads to expression differences in IL2RA mRNA. a The ruler shows chromosome location, with HindIII sites marked by ticks. The top tracks show PIRs for prioritised genes in non-activated (n) and activated (a) CD4 + T cells. Green and purple lines are used to highlight those PIRs containing credible SNPs for the autoimmune diseases T1D and MS fine mapped on chromosome 10p15. Six groups of SNPs (A-F) highlighted in Wallace et al. [25] are shown, although note that group B was found unlikely to be causal. The total number of interacting fragments per PCHi-C bait is indicated in parentheses for each gene in each activation state. Dark grey boxes indicate promoter fragments; light grey boxes, PIRs containing no disease associated variants; and coloured boxes, PIRs overlapping fine mapped disease associated variants. PCHi-C interactions link a region overlapping group A in non-activated and activated CD4+ T cells to the IL2RA promoter (dark grey box) and regions overlapping groups D and F in activated CD4+ T cells only. RNA-seq reads (log2 scale, red=forward strand, blue=reverse strand) highlight the upregulation of IL2RA expression upon activation and concomitant increases in H3K27ac (non-activated, n, green line; activated, a, purple line) in the regions linked to the IL2RA promoter. Red vertical lines mark the positions of the group A SNPs. Numbers in parentheses show the total number of IL2RA PIRs detected in each state. Here we show those PIRs proximal to the IL2RA promoter. Comprehensive interaction data can be viewed at \url{http://www.chicp.org. b Allelic imbalance in mRNA expression in total CD4+ T cells from individuals heterozygous for group A SNPs using rs12722495 as a reporter SNP in non-activated (non) and activated (act) CD4 + T cells cultured for 2 or 4 hours, compared to genomic DNA (gDNA, expected ratio=1). Allelic ratio is defined as the ratio of counts of T to C alleles. 'x'=geometric mean of the allelic ratio over 2-3 replicates within each of 4-5 individuals, and p values from a Wilcoxon rank sum test comparing cDNA to gDNA are shown. `+' shows the geometric mean allelic ratio over all individuals. c Allelic imbalance in mRNA expression in memory CD4 + T cells differs between ex vivo (time 0) and four hour activated samples from eight individuals heterozygous for group A SNPs using rs12722495 as a reporter SNP. p value from a paired Wilcoxon signed rank test is shown.

Additional Files
 Additional File 1: stable1-gene-modules.csv (CVS) Table S1 Gene Modules inferred from WGCNA analysis of microarray timecourse. Expression fold changes and associated false discovery rates (adjusted p values) are from RNA-seq data at the 4 hour timepoint.  Additional File 2: supplementary-figures.pdf (PDF) Figures S1-S15  Additional File 3: stable2-rnaseq.csv (CSV) Table S2 Results of differential expression analysis on RNA-seq data. Features are defined in the GTF file in Additional File 11 : Table S8a.