Dissection of multiple sclerosis genetics identifies B and CD4+ T cells as driver cell subsets

Background Multiple sclerosis (MS) is an autoimmune condition of the central nervous system with a well-characterized genetic background. Prior analyses of MS genetics have identified broad enrichments across peripheral immune cells, yet the driver immune subsets are unclear. Results We utilize chromatin accessibility data across hematopoietic cells to identify cell type-specific enrichments of MS genetic signals. We find that CD4 T and B cells are independently enriched for MS genetics and further refine the driver subsets to Th17 and memory B cells, respectively. We replicate our findings in data from untreated and treated MS patients and find that immunomodulatory treatments suppress chromatin accessibility at driver cell types. Integration of statistical fine-mapping and chromatin interactions nominate numerous putative causal genes, illustrating complex interplay between shared and cell-specific genes. Conclusions Overall, our study finds that open chromatin regions in CD4 T cells and B cells independently drive MS genetic signals. Our study highlights how careful integration of genetics and epigenetics can provide fine-scale insights into causal cell types and nominate new genes and pathways for disease. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02694-y.

therapies targeting B cells have been demonstrated to be remarkably effective in treating patients with MS [6,7].
MS has a strong genetic component and is characterized by a polygenic architecture. To date, over 200 independent genetic variants have been associated with MS risk, the vast majority of which are common variants with small effect sizes on disease risk [8,9]. Prior studies have shown enrichment of GWAS target genes in the peripheral immune system, but it is unclear exactly which cell types within the peripheral immune system drive these observed enrichments of genetic signals.
Human genetics has emerged as a powerful tool for probing the underlying biology of a disease [10]. The identification of genes and pathways prioritized by GWAS associations is not constrained by our prior knowledge of disease mechanisms and can therefore identify novel biological mechanisms. However, a key challenge for translating GWAS findings into biological insights is that most associations are noncoding in nature and likely act by modulating regulatory elements to mediate gene expression [11,12]. Identifying the causal gene at these GWAS signals can be challenging since it is usually unclear which gene a given regulatory element regulates [10,13,14]. A key step in translating genetic associations into biological mechanisms is identifying the cell types in which GWAS variants act and the genes they modulate. To help understand the function of disease-associated variants, genetic associations can be intersected with orthogonal epigenetic and gene expression data [12]. As the epigenetic and gene expression landscape differ from cell type to cell type, examining the enrichments of GWAS data on these orthogonal datasets can identify specific cell types that may be implicated in disease pathogenesis [15,16].
We and others have previously reported a strong enrichment of MS GWAS variants in regulatory regions of multiple cell types of the peripheral immune system [8,9,15,17]. However, it has yet to be determined if these enrichments are driven by shared regulatory mechanisms common to many immune cell types, or whether different mechanisms are present in distinct immune cell populations. To address this gap, we performed detailed analyses of the enrichment of MS GWAS variants in the peripheral immune system to identify cell populations that independently mediate the effects of MS GWAS variants on disease risk.

MS GWAS associations are enriched in progenitor and terminal peripheral immune cells
To identify the causal cell types that uniquely and independently contribute to MS pathogenesis via mediation of genetic effects, we leveraged bulk ATAC-seq data from 16 flow-sorted hematopoietic progenitor and terminal cell populations isolated from human peripheral blood or bone marrow [18][19][20]. These cells represent progenitor and terminal populations from across the hematopoietic tree, enabling investigation of MS GWAS enrichments broadly and across stem, progenitor, and mature cell populations (Fig. 1A). ATAC-seq data were processed, and open chromatin regions (OCRs, i.e., ATAC-seq peaks) were identified as previously described [18,21]. We applied stratified LD SCore regression (LDSC) [16,22] to estimate the enrichment of MS GWAS in OCRs from each of the 16 hematopoietic cell types. LDSC has the distinct advantage in that it leverages the genome-wide polygenic signal in the GWAS summary statistics rather than selecting variants based on p-value thresholds or fine-mapping posterior probabilities.
Applying LDSC, we observed strong statistically significant enrichments across all hematopoietic cell populations, even after correcting for multiple hypothesis testing (Bonferroni-corrected p-value threshold of 3.13×10 −3 ) ( Fig. 1B; Additional file 1, Table  S1). The strongest enrichments for MS GWAS were observed in OCRs from CD4 T cells (enrichment p-value = 1.47×10 −18 ), CD8 T cells (p-value = 4.00×10 −18 ), and B cells (p-value = 3.27×10 −15 ); reflecting their known and emerging roles in MS pathogenesis and as targets of treatment [1,3,23]. We also detected strong enrichment in OCRs from natural killer (NK) cells (p-value =4.23×10 −14 ) which have a less well-established role in MS [23]. Interestingly, we observed enrichments across all progenitor cells, suggesting that many MS genetic associations are located in regulatory regions involved in core cellular processes in immune cell populations.

CD4 T and B cell regulatory regions independently mediate MS genetics
Many of the studied cell populations share common cellular regulatory signatures, which is reflected in the substantial correlation of the OCR profiles across cell populations (Additional file 2, Fig. S1). Hence, we examined whether the strong enrichment observed across cell populations is a result of truly independent cell type-specific enrichments or whether it is due to shared regulatory landscape across immune cell types. To address this question, we applied a joint model in LDSC to measure the contribution of OCRs from a given cell type, stratified on all other cell types in the model along Fig. 1 A Experimental design. Top box shows the hematopoietic cell types analyzed. MS discovery GWAS results were integrated with ATAC-seq profiles generated from the hematopoietic cell types. LDSC was performed to evaluate enrichment of MS GWAS in the OCRs of each hematopoietic cell type. Statistical fine-mapping was also performed on the MS GWAS results, which were then integrated with orthogonal epigenetic data such as promoter capture HiC interactions. This integration of fine-mapping and epigenetic data allowed for identification of putative causal mechanisms at individual loci. B Enrichment of MS GWAS heritability in hematopoietic cell OCRs. Enrichment p-values are shown as −log 10 (p-value) with a set of baseline annotations. We report the p-value of the coefficient τ c , which reflects the SNP heritability of a given annotation stratified on all other annotations in the model. In this joint model where OCRs from all 16 cell types were included, we observed that B cells and CD4 T cells contributed significantly to SNP heritability (coefficient p-value = 3.99×10 −5 and 3.49×10 −4 , respectively), suggesting independent contributions of B and CD4 T cell OCRs to MS GWAS heritability ( Fig. 2A, Additional file 1, Table S2).
To further delineate the cell types with OCRs that specifically mediate the effect of MS GWAS results, we performed a series of pairwise LDSC analyses. In brief, OCRs of a given hematopoietic cell type were stratified against the OCRs of each of the other 15 cell types, as well as the LDSC baseline annotations ( Fig. 2B; Additional file 1, Table  S3). As above with the joint model, we report the p-value of the coefficient τ c . We observed that B cells remained significant even after stratifying on OCRs of any of the other 15 cell populations (coefficient p-value ranging from 1.47×10 −12 when stratifying against HSCs, to 8.33×10 −5 when stratifying against CD4 T cells). This was also the case for CD4 T cells, which remained significant after stratifying on OCRs from any of the other 15 cell populations (coefficient p-values ranging from 3.82×10 −17 when stratifying against HSCs, to 2.39×10 −4 when stratifying against CD8 T cells). In contrast, CD8 T cell OCRs were no longer significant after stratifying against OCRs from CD4 + T cells (coefficient p-value = 0.21), though they were significant when stratifying on any of the other cell populations. NK cells were also no longer significant after stratifying against either CD4 T cells (coefficient p-value = 0.165) or against CD8 T cells (coefficient p-value 0.356). These results indicate that the enrichment of both CD8 T cells and NK cells can be largely explained by shared regulatory landscapes that are also Fig. 2 A LDSC enrichment results for MS GWAS enrichment in OCRs from across hematopoietic cell types in a joint model. Heights of the circles reflect LDSC coefficient (τ c ) p-values, which measures whether the annotation (i.e., OCRs for a given cell type) contributes significantly to SNP heritability in an overall model that includes OCRs for all hematopoietic cell types and baseline annotations. Sizes of the circles are proportional to the enrichment p-values for that given cell type, with larger circles reflecting more significant p-values. B LDSC enrichment p-values for pairwise stratified LDSC of MS GWAS results in OCRs from hematopoietic cell types. Y-axis are the index cell types with LDSC enrichment p-values prior to stratifying in parentheses. X-axis shows the comparator cell type being conditioned upon. Boxes are shaded by the LDSC coefficient p-values for the index cell type after conditioning on the comparator cell type in the pairwise model (with darker colors representing stronger enrichments). Red stars indicate pairwise comparisons that are statistically significant a Bonferroni-corrected p-value threshold of 2.2×10 −4 present in CD4 T cells. Prior studies have also suggested a role for monocytes in MS [23,24]. OCRs from monocytes had an enrichment p-value of 4.17×10 −9 , but we observed that stratifying on OCRs from CD4 T cells or B cells ameliorated this monocyte heritability enrichment (coefficient p-values 0.166 and 0.266, respectively).
We also performed a separate analysis examining whether OCRs specific to a given cell type mediate MS heritability enrichments. For each of the mature hematopoietic cell populations, we performed LDSC on only the cell type-specific OCRs, i.e., ATACseq peaks present in only that cell type. B cells exhibited a statistically significant enrichment of cell-specific OCRs for MS GWAS (enrichment p-value = 3.27×10 −4 ). CD4 T cell-specific peaks were nominally significant at a p-value of 0.013 but did not survive correction for multiple hypothesis testing (Bonferroni-corrected p-value threshold of 5.6×10 −3 ). Cell type-specific peaks for all other terminal hematopoietic cell types had enrichment p-values > 0.05 (Additional file 2, Fig. S2; Additional file 1, Table S4).

MS genetic associations are mediated in terminal immune cell populations
In the lymphoid lineage, we observed stronger enrichments in terminal cell populations than we did for the progenitor populations (Fig. 1B). For example, the strongest enrichment in progenitor cells was observed for common lymphoid progenitor cells (CLP, enrichment p-value 2.32×10 −4 ), and it was orders of magnitude less statistically significant compared to the enrichment observed for CD4 T or B cells (Fig. 1B). For each of the terminal populations, the significance remained even after stratifying against OCRs from CLP; however, the converse was not true (Fig. 2B). The enrichment in CLP was completely ameliorated by stratifying against B cell or CD4 T cell OCRs (coefficient pvalue 0.98 and 0.966, respectively, Fig. 2B). Together, these results suggest that terminal cells of the lymphoid compartment retain cellular regulatory features from their progenitor populations that are important for MS pathogenesis, but have also developed specific regulatory features of additional importance to MS susceptibility.

Comparison of immune cell enrichment with neuropsychiatric and autoimmune disorders
We next sought to understand how the immune cell enrichments in MS might be similar or different from those of other autoimmune or neuropsychiatric disorders. To test this, we calculated heritability enrichment within these 16 hematopoietic OCRs for GWAS of various other neuropsychiatric disorders and autoimmune disorders: Alzheimer disease (AD) [25], schizophrenia (SCZ) [26], bipolar disorder (BPD) [27], type 1 diabetes (T1D) [28], Crohn's disease (CD) [29], ulcerative colitis (UC) [29], systemic lupus erythematosus (SLE) [30], rheumatoid arthritis (RA) [31], and primary biliary cirrhosis (PBC) [32] (Additional file 2, Fig. S3A; Additional file 1, Table S5). We identified previously recognized cell-type enrichments across these other diseases, such as enrichments in OCRs from B cells (enrichment p-value 1.93×10 −7 ), CD4 T cells (p-value 7.34×10 −8 ), and CD8 T cells (p-value 2.09×10 −6 ) for RA [15,31]. However, the heritability enrichments for OCRs from these hematopoietic cell populations tended to be much stronger for MS, despite similar sample sizes, e.g., 41,505 disease cases for MS and 38,242 disease cases for RA. To parse out cell type-specific enrichments in these other disorders, we also tested heritability enrichments under the joint model in LDSC by including OCRs from all 16 cell types (Additional file 2, Fig. S3B, Additional file 1, Table S6). In a similar fashion to the MS GWAS joint analyses above, we measured the contribution to heritability for a given set of OCRs of interest, controlling for the effects of a set of baseline annotations and OCRs from all other hematopoietic cell types. Across the nine other comparator diseases, the only other statistically significant stratified enrichments were in OCRs from B cells in SLE GWAS (coefficient p-value 8.53×10 −5 ). These results highlight the striking heritability enrichments in OCRs from CD4 T and B cells in the MS GWAS.

The CD4 T cell MS GWAS enrichment is driven by the Th17 subset
To further focus in on the MS GWAS enrichment observed in OCRs from CD4 T cells, we utilized ATAC-seq data from various subsets of human CD4 T cells (Fig. 3A) [21]. We examined data from effector CD4 T cells (naïve effector CD4 T cells, T h 1, T h 2, T h 17, and follicular T h ) as well as regulatory CD4 T cells (naïve T regs and memory T regs ). We observed strong enrichments for OCRs from both effector and regulatory CD4 T cell populations ( Fig. 3B; Additional file 1, Table S7). Next, to identify the independent contribution of a given cell type, we applied the joint model in LDSC by including OCRs from all CD4 T cell populations together. This joint LDSC analysis revealed that OCRs from T h 17 cells independently contributed to heritability (coefficient p-value = 4.69×10 −4 ; Fig. 3C; Additional file 1, Table S8). These results suggest that among CD4 T cells, OCRs from T h 17 cells drive the signal for enrichment in MS GWAS.

Memory subpopulations explain the enrichment of MS GWAS in B cells
We next examined enrichments of MS GWAS data in ATAC-seq from the B cell lineage including naïve B cells, memory B cells, and plasmablasts ( Fig. 4A) [21]. We found that OCRs from all B cell lineage cell types were significantly enriched for MS heritability ( Fig. 4B; Additional file 1, Table S10) Table S11). These results suggest that in the B cell lineage, the MS GWAS enrichment signal is driven by OCRs in memory B cells.

Immune cells from MS patients reinforce independent CD4 T and B cell enrichments
Next, we tested whether the MS GWAS enrichment in CD4+ T and B cells were also present in OCRs from the respective immune cell types derived from individuals with MS. We utilized ATAC-seq data performed in flow-sorted bulk CD4 T and B cell subsets (n = 6) derived from six patients with MS who were not treated with immunomodulatory therapy within at least 4 months of sample collection (see Additional file 1, Table S13 for clinical details). LDSC showed statistically significant enrichments of transitional B cells (traB; enrichment p-value = 2.08×10 −3 ), class switched classical memory B cells (cMBc; p-value = 2.58×10 −4 ), effector memory CD4 T cells (T4 em ; pvalue = 1.87×10 −4 ), and CD45RA+ effector memory CD4 T cells (T4ra; p-value = 3.02×10 −4 ). Central memory CD4 T cells had an enrichment p-value of 1.53×10 −3 , which was not significant after correcting for multiple hypothesis testing ( Fig. 5A; Additional file 1, Table S14).
To further identify independent cell type enrichments, we performed joint models in LDSC as described above. We first tested a model that included CD4 T cell subsets (T4nv, T4cm, T4em, and T4ra). In this joint model, only T4em was statistically significant (coefficient p-value 7.45 × 10 −3 ), reflecting the independent contribution of effector CD4 T cells in MS that we observed above using cells from healthy individuals ( Fig. 5B; Additional file 1, Table S15). We also ran a model that included B cell subsets (traB and cMBc). In this joint comparison, neither cell type was statistically significant when correcting for multiple hypothesis testing. However, cMBc had a coefficient p-value that was nominally significant (p-value = 0.037), reiterating the independent contributions of mature B cell types ( Fig. 5C; Additional file 1, Table S16).
Immunomodulatory treatments suppress mediation of MS genetics in cell-specific fashion Next, we tested whether immunomodulatory treatments alter the cell-specific mediation of MS genetic associations by utilizing data for the same immune subsets sorted  Table S17), though the magnitude of the enrichments were attenuated relative to the signals observed from cells from untreated patients (compare Fig. 5A, D). To better understand this attenuation of enrichments, we ran joint models in LDSC for T4em and cBMc cells in which we included OCRs from untreated patients and treated patients. In a joint model with T4em OCRs from treated and untreated MS patients, only OCRs from untreated patients were statistically significant (Additional file 2, Fig. S4A; Additional file 1, Table S18). Similarly, in a joint model with cMBc OCRs from treated and untreated MS patients, only OCRs from untreated patients were statistically significant (Additional file 2, Fig. S4B; Additional file 1, Table S19). Together, these results suggest that immune-modulating therapies may attenuate the chromatin accessibility signals at MS GWAS.
To further delineate the chromatin states with the strongest MS genetic associations, we examined enrichments of MS GWAS results in the predicted chromatin states for B cells and T h 17 T cells as available in the RoadMap Epigenomics Project [35]. For T h 17 CD4 T cells, the "EnhA2" chromatin state (Active Enhancer 2) were statistically enriched (enrichment p-value = 1.19 × 10 −3 ) (Additional file 2, Fig. S6A; Additional file 1, Table S22). For B cells, "Tx3" (Transcribed 3' preferential; enrichment p-value = 1.80×10 −3 ) and "PromD1" (Promoter Downstream TSS 1; p-value=4.40×10 −4 ) were statistically enriched, again reflecting the strongest enrichments at active regulatory elements (Additional file 2, Fig. S6B; Additional file 1, Table S23). These results demonstrate that MS GWAS variants act through activating regulatory elements, consistent with the autoimmune nature of MS.

Fine-mapping of MS GWAS loci in cell-specific OCRs
We next sought to understand underlying mechanisms by nominating putative causal genes and variants in a cell-specific fashion. First, we applied statistical fine-mapping to nominate likely causal SNPs. Most statistical fine-mapping approaches require association information across all SNPs in a given locus. In contrast, for MS GWAS, genome-wide results are based on targeted replication analyses, which by design included only a select subset of SNPs within each locus. To overcome this challenge, we applied PICS to perform statistical fine-mapping, which, as compared to other statistical fine-mapping approaches, does not require GWAS summary statistics for all SNPs in a region [15]. We defined for each locus a 95% credible set such that the sum of the posterior probabilities for variants in that credible set is greater than or equal to 95%. For fine-mapping, we used the joint analysis MS GWAS results, which include only replicated genome-wide effects [8]. We included all 200 non-MHC loci where the MS GWAS joint association p-value was less than 5×10 −8 . Next, we prioritized SNPs if they had a PICS posterior probability (PP) > 1% and were included in the 95% PICS credible set. This is a liberal threshold for defining prioritized SNPs, aimed at increasing sensitivity for detecting possible causal variants. Across the 200 loci, there were 3436 credible set variants (1-58 variants per locus) (Additional file 2, Fig. S7A). At 19 loci, there was only one variant in the credible set, and at 37 loci, there were four or fewer prioritized variants (Additional file 2, Fig. S7B).
Next, we intersected the 3436 credible set variants with OCRs from the 16 hematopoietic cell populations, which we chose to use as they cover a broad range of hematopoietic cell types [18]. Across the 200 loci, 870 of the prioritized variants overlapped an OCR in at least one cell type. Remarkably, 163 out of the 200 loci had at least one prioritized variant overlapping an OCR in any cell type. B and CD4 T cells were the cell types with the greatest number of loci with a credible set SNP overlapping an OCR, 126 and 125 loci respectively (Additional file 2, Fig. S8). We further examined the statistical overlap of the PICS credible set variants with OCRs from CD4 T cell and B cell subpopulations. Among B cells, we found that memory B cells had the strongest enrichment (1.51 fold; p-value 4.33×10 −16 ) and remained statistically significant when examining only OCRs present specifically in memory B cells (Additional file 1, Table  S24). Among CD4 T cells, we found that T h 17 cells had the strongest enrichment (1.72 fold; p-value 2.21 × 10 −26 ) and also remained statistically significant when examining only OCRs present specifically in T h 17 cells (Additional file 1, Table S25). These results confirm that statistical fine-mapped variants highlight memory B cell and T h 17 T cell populations and that statistical fine-mapping results are useful for further functional prioritization.

Integration of genetic and epigenetic data identifies putative causal genes
The vast majority of GWAS-associated variants are noncoding and regulate genes that may be far from the variant in terms of linear distance on the genome. To address this challenge and nominate putative causal genes that are regulated by the MS-associated OCRs, we leveraged promoter capture Hi-C data (PCHiC), which identifies chromatin looping interactions between regulatory elements and target gene promoters. We utilized PCHiC data performed on 17 hematopoietic cell populations, which partially overlap with the cell types for which we have ATAC-seq data [36]. These cell populations include naïve B cells, total B cells, activated CD4 T cells, non-activated CD4 T cells, and total T cells. We considered only GWAS loci where a credible set MS GWAS SNP overlapped both an OCR and a PCHiC interaction. We examined the statistical enrichment of PICS credible set variants and PCHiC loop regions that overlap an OCR. We found that among all cell types, PCHiC loops from naïve B (enrichment p-value 2.09 × 10 −11 ) and activated CD4 T cells (p-value 1.77 × 10 −14 ) had the strongest statistical enrichments (Additional file 1, Table S26). This result confirms that PCHiC from B cells and CD4 T cells are highly enriched for statistically fine-mapped variants and may be useful for further prioritization of variants.
Having demonstrated that PCHiC loops and OCRs map onto MS GWAS variants, we next asked whether eQTLs would colocalize SNPs overlapping these epigenetic features. We examined whether the MS GWAS loci that overlap OCR and loops colocalize more often with cell-specific expression quantitative trail locus (eQTL) studies than expected by chance. We performed colocalization [37] using CD4 T and B cell-specific immune cis-eQTL results from DICE [38]. We identified a small number of colocalized loci with eQTL eGenes, ranging from 15 SNP-eGenes pairs for stimulated CD4 T cells to 34 for T h 17 cells, of which about one third overlapped the previously identified OCR+loops per cell type (Additional file 2, Fig. S9). Naïve Tregs and T h 17 exhibited the strongest enrichment of colocalized loci that overlap OCR and loops with follicular T h , memory Treg, and T h 1/17 also passing the Bonferroni-corrected enrichment threshold (Additional file 2, Fig. S10). We note that we used a less stringent threshold of a colocalization posterior probability of 0.6 for these enrichment analyses to increase the number of observations for examining OCR overlap.
To identify putative causal genes in an unbiased fashion, we integrated the MS finemapping data with the OCR and PCHiC loops to nominate 261 genes within 86 MS loci in B cells and 364 genes within 115 MS loci in CD4 T cells (Additional file 1, Table  S27). We note that these genes are putative causal genes, representing a list of genes that could be linked with the MS loci via a regulatory mechanism in the respective cell types. The majority of these genes were shared between B and CD4 T cells (n = 178; 68.2% and 48.9% respectively; Fig. 6A). We have previously suggested a list of putative causal genes (n = 551) based on an ensemble of methods that did not include ATACseq or chromatin interactions [8]. Of these 551 previously nominated genes, 67 (12.2%) overlapped with our B cell prioritized genes and 111 (20.1%) with our CD4 T cell prioritized genes, highlighting that our current mechanism-specific gene prioritization is capturing a large number of potentially causal genes that have not been previously implicated in MS genetic studies.
Next, we utilized the list of putative causal genes to identify enriched canonical pathways. Starting with CD4 T and B cell lists, we created additional lists for the common genes (shared between CD4 T and B cells), and finally genes unique to CD4 T cells and B cells. We observed widespread pathway enrichment for the putative causal genes of the CD4 T cells (n = 294) and B cells (n = 236) at FDR<5%. The common set of genes was enriched in 85 pathways (Additional file 1, Table S28). The B unique gene list (n = 83) was enriched in 22 canonical pathways, including lipoprotein and cholesterol pathways, the CD40 pathway, and JAK-STAT pathway (Fig. 6B, C). The unique genes in CD4 T cells (n = 186) were enriched in 99 pathways, including TCR pathways, various interleukin pathways, and MAPK/ERK signaling pathways (Fig. 6B, C).
Pathway analyses utilize known biological connections for a given set of genes but many of underlying mechanisms could be still uncharacterized. Thus, we leveraged protein-protein interaction (PPI) data to test whether the respective putative causal gene lists exhibit a high degree of connectivit y [39]. A similar percent of the mapped CD4 T cell and B cell prioritized genes were directly connected, 48.9% and 43.7% respectively (Additional file 1, Table S29; Additional file 2, Figs. S11-15). Only the CD4 T and CD4 T unique gene lists demonstrated a higher degree of connectivity than expected (p-value<0.05), although all gene lists had communities of genes with high connectivity (p-value<0.05; Additional file 1, Table S29). These results are consistent with the pathway analyses, implying that the MS genetics are mediated by several different mechanisms in both cell types, some of which are shared and some of which are celltype specific.

A fine interplay between shared and cell-specific B and CD4 T cell putative causal genes
The results of the pathways and PPI analyses suggest that the MS genes common and unique to B and CD4 T cells do not act independently but rather have shared cellular mechanisms. We illustrate this by studying a locus on chromosome 19 (Fig. 7) where the lead SNP rs1465697 (chr19:49837246 C>T) has a MS GWAS association p-value of 3.02×10 −18 . The lead SNP is the most highly prioritized variant by statistical finemapping out of 21 overall SNPs in the 95% credible set (PP=15% for rs1465697; next highest PP 9%). This SNP overlies an OCR present in all lymphoid lineages, including B cells, CD4 T cells, and CD8 T cells. We have previously suggested five putative causal genes for this locus: DKKL1, CCDC155, CD37, TEAD2, and SLC6A16 [8]. Using PCHiC data, this OCR forms a chromatin loop interaction with TEAD2 and DKKL1, but not the other three genes. This chromatin loop interaction is observed in activated CD4 T cells, naïve B cells, total B cells, naïve CD8 T cells, and fetal thymus, but none of the other hematopoietic cell types [36]. Furthermore, this SNP is an eQTL for TEAD2 in B cells, but is not an eQTL for any other gene in this locus in any of the available hematopoietic cell types (Fig. 7) [38]. Together, these lines of evidence support TEAD2 as the causal gene at this locus.
Interestingly, TEAD2 is a transcription factor with 1459 predicted regulated genes [40], including 38 putatively causal CD4 T cell genes and 25 B cell genes as nominated above (FDR < 1 %, FDR < 1%, respectively; Additional file 2, Figs. S16-17). The majority of these genes are common in both CD4 T and B cells (n = 23; FDR < 1%; Additional file 2, Fig. S18). To identify genes whose expression is modulated by TEAD2 in CD4 T and B cells, we examined TEAD2 knockdown (KD) and over-expression (OE) in cancer cell lines (n = 8, respectively) from the Library of Integrated Network-Based Cellular Signatures (LINCS) Program [41]. Although these cell lines do not represent an ideal experimental model to study the effect of TEAD2 in immune cells, they can still be used to understand mechanisms reflecting core cellular functions. Within each cell line, we identified genes whose expression changed in the opposite direction (top or bottom 10%) in the KD versus OE models. These ranged from 7 to 24 for the CD4 T cell prioritized genes (Additional file 1, Table S30; Additional file 2, Fig. S19) and from 3 to 16 for the B cell genes (Additional file 1, Table S31; Additional file 2, Fig. S23). Of these, 17 genes and 9 genes changed expression in at least 2 cell lines, respectively (Additional file 1, Table S30-31). These data demonstrate that perturbation of TEAD2, a key immune cell transcription factor, results in indirect changes of putative MS causal genes in B and CD4 T cells.

Discussion
In this paper, we integrated MS GWAS with chromatin accessibility data from a broad array of peripheral immune cells in order to identify putative causal cell types. Our analyses identified regulatory regions in B cells and CD4 T cells as each being independently enriched for MS genetics. Within the CD4 T cell and B cell populations, we further identified OCRs from T h 17 cells and memory B cells as specifically driving their respective enrichments. Chromatin data from MS patients reiterated these findings and further suggested that immunomodulatory treatments alter the chromatin accessibility overlying MS-associated GWAS variants. Integration of PCHiC data led to prioritization of putative causal genes in B and CD4 T cells, identifying target genes that are both shared and specific to each of these cell populations. The putative causal genes implicate several known signaling pathways, mostly due to the cell-specific MSassociated genes, despite these representing a smaller percentage compared to genes shared between B and CD4 T cells. Finally, we illustrate that the B and CD4 T cell mechanisms are intertwined by describing how TEAD2, a putative causal gene shared between B and CD4 T cells, contributes to disease susceptibility directly and indirectly by targeting both shared and cell-specific MS genes.
Our study provides genetic evidence for the independent involvement of both CD4 T and B cells in the pathogenesis of MS [1][2][3]. It further supports the long-debated causal role of memory B cells in MS [23,42,43], shown by the highly effective therapies targeting B cell receptors, most notably ocrelizumab [7]. Through our analyses, we also corroborate decades of research that have demonstrated a primary role of CD4 T cells in MS, including the importance of T h 17 T cells [23,44,45]. For both the B cell and T h 17 T cell populations, we find that active enhancers and promoters drive the enrichment signal, consistent with activating roles of these cell types in MS as an autoimmune condition. The complex interplay between the B and CD4 cells in MS pathogenesis [2,3,42] is also reflected by our finding of some OCRs present in both cell types, while other OCRs are present only in specific cell populations.
Although other peripheral immune cell types have been shown to be involved in MS, including monocytes, CD8 T cells, and mDCs [23], we did not identify independent enrichments for these cell types. One possible explanation for this apparent contradiction is that these cell types might work secondarily to memory B and T h 17 cells, which are more directly under influence from MS GWAS-associated variants. Non-genetic effects, e.g. environment-specific response, could also explain their role in MS above and beyond any shared mechanisms with B and CD4 T cells. Further, context-specific studies, such as under various cell activation conditions, would be necessary to unravel any potential independent influence of these cells by MS genetic variants. Another potential caveat is that we utilized bulk sorted cell populations, which may miss subpopulations of cells that bear chromatin signatures that mediate independent effects of MS genetics. One approach to overcome this is using single-cell epigenetic approaches, which will be an important area of future exploration. Lastly, while our analyses do not identify an independent genome-wide enrichment for cell types other than B and CD4 T cells, GWAS variants may still act at individual loci in these other cell types.
One of the key challenges of GWAS is moving from genetic association to biological mechanisms [10,13,14]. This is driven by three main challenges. First, linkage disequilibrium, while highly advantageous to discovery of genetic associations, limits our ability to identify the causal variant. Second, as most GWAS variants are noncoding, identifying the gene(s) that are affected by the causal variant can be difficult. Third, the cell type(s) in which a given associated variant acts can be unclear. Our study demonstrates how we can use statistical fine-mapping to help solve the first challenge, though this is not without multiple caveats [46]. We integrated orthogonal datasets (ATACseq, PCHiC, and eQTL data) to help delineate the likely causal genes and cell types and overcome the latter two challenges. We document how shared and cell-specific genes affect putative causal pathways. We further illustrate the complex interplay between shared and cell-specific putative causal MS genes by studying the TEAD2 locus, a transcription factor recently implicated in immune regulation [47].

Conclusions
Together, our study generates important insights into the driver subpopulations of peripheral immune cells in MS, reinforcing how MS genetics act primarily through B and CD4 T cells. Our study also demonstrates the need for in-depth context-specific cellular data to carefully delineate the causal role of each immune cell subset in MS.

Epigenetic and eQTL datasets
Hematopoietic progenitor and terminal ATAC-seq data ATAC-seq data for 16 different human hematopoietic progenitor and terminal populations was obtained from Corces et al. [18] and Buenrostro et al. [19]. These ATAC-seq profiles were generated on bulk FACS-sorted cells from human peripheral blood or bone marrow cells. Alignment of the ATAC-seq data and peak-calling were performed as previously described [20]. To identify cell-type-specific peaks for each cell type, we used ATAC-seq peaks for that cell type and removed any peaks that overlapped with a peak present in any one of the other 15 cell types. A single base pair overlap was considered to be overlapping.

Immune cell ATAC-seq data
We used publicly available immune cell ATAC-seq data (NCBI GEO GSE 118189) derived from flow-sorted peripheral blood cells [21]. As each cell type had between 1-4 human donors, we merged the raw ATAC-seq data from the individual donors for a given cell type. We aligned ATAC-seq reads using bowtie2 version 2.2. 1 [48] with default parameters and a maximum paired-end insert distance of 2000 base pairs. The bowtie2 index was constructed with the default parameters for the hg19 reference genome. We filtered out reads that mapped to the mitochondria and used samtools version 1.10 [49] to filter out reads with MAPQ < 30 and with the flags "-F 1804" and "-f 2." Additionally, duplicate reads were discarded using picard version 2.20.6 (http:// broadinstitute.org.github.io/picard). Finally, chromatin accessibility peaks were identified with MACS2 version 2.1.1 [50] under default parameters and "--nomodel --nolambda --keep-dup all --call-summits."

ChIP-seq data
We downloaded available pre-processed ChIP-seq peak calls from ENCODE for B cells [33]. Where replicates were available, the bed files for the replicates were merged to create a composite set of peaks for each histone mark. Data for T h 17 histone ChIP-seq were downloaded from the Roadmap Epigenomics Project [34] (NCBI GEO GSM997225); we used pre-processed ChIP-seq peak calls generated in Amariuta et al. [51].

chromHMM
We used a 25 chromatin state model [35], which are imputed based on 12 epigenetic marks from across 127 epigenomes generated as part of the Roadmap Epigenomics Project [34]. We used the chromatin states from B cells and Th17 CD4+ T cells. Chromatin states were downloaded from https://egg2.wustl.edu/roadmap/web_portal/chr_ state_learning.html. We excluded the "Quiescent/Low" cell state, as it encompasses a large proportion of the genome, resulting in unstable estimates of heritability.

Verily ATAC-seq data
We utilized immune cell ATAC-seq data generated from Verily as part of the SysteMS collaboration with Brigham & Women's Hospital (PIs: Dr. Chitnis and Dr. Weiner).
Cell sorting is done as follows: Frozen cryovials in liquid nitrogen were thawed in a 37°C bead bath and centrifuged for 5 min at 600×g, 4°C. The cell pellet was washed with 1 mL of FACS buffer, and the wash repeated. The cell pellet was resuspended in the residual volume with 2.5 μL of 0.33 mg/mL S7 DNAse. Fifty microliters of staining cocktail was added for the respective flow cytometry panels to be analyzed (T cell, B cell, myeloid panel) and incubated for 25 min on ice and in the dark. Cells were washed in FACS buffer, resuspended in a final volume of 400 μL FACS buffer, and passed through a 35-μm cell strainer cap. Stained samples were sorted on a FACSAria Fusion (BD Biosciences, San Jose, CA). Using FACSDiva v8.0.1 software, the samples were gated first by forward and side scatter properties, then FSC-H vs FSC-A for singlet discrimination, and finally, with their respective markers for each cell type (Additional file 1, Table S32). For each cell type of interest, up to 500 cells were sorted into the tagmentation buffer.

ATAC-seq library preparation and sequencing
Cells were sorted directly into 20 μL of cold tagmentation buffer (10 μL TD, 2 μL 2% IGEPAL CA-630, 6 μL nuclease-free H 2 O, 2 μL TDE1 per sample), followed by incubation at 37°C for 30 min with shaking at 500 RPM. Samples were stored at −20°C until further processing. DNA was extracted with the QIAGEN MinElute PCR purification kit according to the manufacturer's protocol, and samples were amplified with KAPA HiFi kits and Illumina Nextera indices. The amplified material was cleaned with the QIAGEN MinElute PCR purification kit and quantified using KAPA library quantification kits. Samples were normalized and pooled for sequencing on the NextSeq (Illumina).

Processing
Paired-end raw ATAC-seq reads were trimmed using NGmerge 60 using the default parameters. The reads were then aligned to GRCh38 using Bowtie2 version 2.3.5 49 . The resulting SAM files were converted and sorted into BAM format using samtools version 1.5 50 . We filtered out the reads with MAPQ<10 and reads that were aligned to mitochondria using samtools. In addition, duplicate reads were removed using picard version 2.20.6 (http://broadinstitute.org.github.io/picard). Finally, peaks were called using MACS2 version 2.2.5 51 with default parameters and --keep-dup all --nomodelnolambda. To obtain peaks for each cell type, we merged the peak files from all samples for that specific cell type using bedtools version 2.29 61 .

Enrichment of GWAS results within ATAC-seq peaks
To calculate enrichments of the MS GWAS data within annotations (e.g., ATAC-seq or ChIP-seq peaks), we applied stratified LD SCore regression (LDSC) [16,22]. LDSC was performed using LDSC v1.0.0 (https://github.com/bulik/LDSC), which was run on the discovery summary statistics from the MS GWAS discovery stage summary statistics [8]. The human MHC locus was excluded given its complex LD patterns as recommended by Finucane et al. [16] To run LDSC, we used precomputed LD scores based on the European ancestry samples of the 1000 Genomes Project Phase 1 [52] which was restricted to HapMap3 SNPs [53], and we generated partitioned LD scores for each set of annotations. To perform LDSC, we regressed the summary statistics (χ 2 ) from a given GWAS on to annotation-specific LD scores, with baseline scores (original 53 annotation model), regression weights and allele frequencies based on 1000 Genome Project Phase 1 data as precomputed by software authors. We applied partitioned heritability analyses using LDSC under three different models: (1) To ask how much a given annotation contributes to trait heritability, we used a LDSC model that includes baseline annotations and an annotation of interest. The heritability enrichment of the annotation was defined as the proportion of SNP heritability in the category divided by the proportion of SNPs in that category; we report statistical significance of this enrichment as p-values. (2) When comparing multiple annotations (e.g., ATAC-seq peaks from different cell types), we ran a LDSC model that includes the baseline model and annotations from all cell types. In this scenario, we calculate for each annotation the coefficient τ c which measures the contribution to SNP heritability for a given annotation to heritability in this overall model, stratified on other annotations in the model. Zscores for the coefficient τ c were converted to a one-sided p-value, which we report as a measure of statistical significance. (3) We also performed LDSC on pairs of annotations, which we term pairwise stratified LDSC. In these models, we include the baseline model, an index annotation of interest, and a comparator annotation of interest. To run pairwise stratified LDSC, we used a previously described extension of LDSC [54].
Throughout, all default LDSC parameters were used.

Statistical fine-mapping
Statistical fine-mapping was performed using the marginal p-values from the replication (joint analysis) summary statistics from the MS GWAS [8]. The 200 genome-wide significant loci at p<5×10 −8 were used. LD was calculated between each lead variant and all variants with r 2 >0.2 and within a 2-Mb window based on the 1000 Genomes Phase 1 (European subset) reference panel [52]. PLINK v1.90b3.32 was used to perform LD calculations [55,56] with parameters of "--r2 --ld-window-kb 2000 --ld-window 999999 --ld-window-r2 0.2." We then applied PICS to each locus [15]. Briefly, PICS uses the lead association pvalue and LD structure of the locus to calculate the most likely causal SNPs given the observed lead association signal. PICS probabilities represent the probability of a given SNP in a locus being the causal SNP. Default PICS parameters were used. From the PICS probabilities, we calculated 95% credible sets (CS). We defined the 95% CS as a set of variants such that the true causal variant has a 95% chance of being in the credible set. To calculate credible sets, for each locus, we ranked variants in descending order by their PICS probabilities. We then iteratively added variants to the credible set for that locus until the sum of their PICS probabilities was greater than or equal to 0.95. For CS inclusion, we also required the variant to have a PICS probability > 0.1.

Identification of target genes
We leveraged promoter capture Hi-C (PCHiC) data from 17 hematopoietic cell populations to link genetic associations with genes that they may regulate [36]. We filtered the PCHiC dataset for looping interactions with a CHiCAGO score > 5 [57]. An overlap between a GWAS variant and a PCHiC looping interaction was considered if the GWAS variant overlapped any position in the non-promoter ("other end") of the PCHiC interaction. For CD4 + T cells, we considered only GWAS SNPs that overlapped an ATAC-seq peak in bulk CD4 + T cells or any of the CD4 + T cell subsets (naïve effector CD4 + T cells, T h 1, T h 2, T h 17, follicular T h , naïve T regs , and memory T regs ), and which overlapped a PCHiC interaction in naïve CD4 + T cells (nCD4), total CD4 + T cells (tCD4), nonactivated total CD4 + T cells (naCD4), or activated total CD4 + T cells (aCD4). For B cells, we considered only GWAS SNPs that overlapped an ATAC-seq peak in bulk B cells or any of the B cell subsets (naïve B cells, memory B cells, or plasmablasts), and which overlapped a PCHiC interaction in naïve B cells (nB) or total B cells (tB).

Colocalization of MS GWAS loci with eQTL studies
We performed colocalization analyses utilizing coloc [37], MS GWAS summary statistics [8], and cis-eQTL analyses of CD4 T and B cells from the DICE database [38]. We included the following subsets: T h 1, T h 2, T h 1/17 (labeled as THSTAR in the provided summary statistics), T h 17, naïve CD4, stimulated CD4, naïve Tregs, memory Tregs, and naïve B cells. Sorting details are available in the DICE original publication [38]. For the MS GWAS, we included per locus all variants within a ±100 kb region. For the eQTL studies, we included variants with any p-value around ±1 megabase of the transcription starting site (TSS), as provided by the DICE authors, and any eGene with a significant eQTL association at 10% FDR. Coloc was run on the common set of variants and we defined a priori as colocalized a locus-eGene pair that had a posterior probability of both traits being associated and share a single causal variant (H4) larger than 60%. Per cell subtype OCR+loop enrichment was estimated with Fisher's exact for the 2×2 table of a locus-eGene being colocalized vs. overlapping OCR+loop within either CD4 T cells or B cells.

Correction for multiple hypothesis testing
Throughout our manuscript, we use Bonferroni corrections when testing multiple hypotheses. To generate a Bonferroni-corrected p-value threshold, we used a traditional p-value threshold of 0.05 divided by the number of tests being performed in a given analysis. We note that as many of the tests are correlated (since the underlying annotations are often highly correlated with each other), the effective number of independent tests being performed is fewer than the number of tests actually performed. As such, our analyses are overly conservative. We decided on this approach of using Bonferroni corrections as opposed to false discovery rate (FDR) approaches, as the number of tests being performed is often small, leading to unstable estimates of FDR.

Gene set enrichment analyses
We performed pathway analyses utilizing the canonical pathways (CP) of the Molecular Signatures Database (MSigDB v7.2), as it is available from the Gene Set Enrichment Analysis website (http://software.broadinstitute.org/gsea/msigdb). We ran the Canonical Pathways, Biocarta, KEGG, and Reactome gene set categories together in the same model. We estimated statistical significance using the hypergenometric distribution and applied false discovery correction, as previously described [8]. The same model was applied for the enrichment of prioritized gene sets with Gene Transcription Regulation Database (GTRD) transcription factor targets gene sets [40]. Significant enrichment level was set to a false discovery rate < 5%.

Protein-protein interaction networks
We utilized GeNets (https://apps.broadinstitute.org/genets) [39] to leverage known protein-protein interactions (PPI) of our prioritized gene sets. GeNets uses a random forest classified, trained in PPI data with 18 parameters that capture information about centrality and clustering. It creates communities of genes, sets of genes (nodes) that are connected to each other more than genes outside this community. Furthermore, it uses the random forest classifier and the connectivity to the tested gene set to propose candidate genes. For each described network, the p-value is estimated by testing whether the number of observed edges exceeds the numbers of possible edges using permutations. We ran GeNets via the web interface with the GeNets Metanetwork v1.0 and utilizing the InWeb model ("Override network the analysis model was trained on" option).

TEAD2 knockdown and over-expression in LINCS cell lines
Robust z-scores, "level 5 data," from knockdown (KD) or over-expression (OE) of TEAD2 in cancer cell lines [41], were downloaded from clue.io (https://clue.io/ command?q=/sig%20%22TEAD2%22). The robust z-scores represent differential expression for each genetic perturbagen, adjusted for the gene expression of all other perturbagens on the same physical plate. For knockdown and over-expression experiments, the differential expression comparator were samples using a vector control, which are negative genetic controls that either lack a gene-specific sequence or target a non-human gene (like GFP).