Non-coding RNAs underlie genetic predisposition to breast cancer

Background Genetic variants identified through genome-wide association studies (GWAS) are predominantly non-coding and typically attributed to altered regulatory elements such as enhancers and promoters. However, the contribution of non-coding RNAs to complex traits is not clear. Results Using targeted RNA sequencing, we systematically annotated multi-exonic non-coding RNA (mencRNA) genes transcribed from 1.5-Mb intervals surrounding 139 breast cancer GWAS signals and assessed their contribution to breast cancer risk. We identify more than 4000 mencRNA genes and show their expression distinguishes normal breast tissue from tumors and different breast cancer subtypes. Importantly, breast cancer risk variants, identified through genetic fine-mapping, are significantly enriched in mencRNA exons, but not the promoters or introns. eQTL analyses identify mencRNAs whose expression is associated with risk variants. Furthermore, chromatin interaction data identify hundreds of mencRNA promoters that loop to regions that contain breast cancer risk variants. Conclusions We have compiled the largest catalog of breast cancer-associated mencRNAs to date and provide evidence that modulation of mencRNAs by GWAS variants may provide an alternative mechanism underlying complex traits.


Introduction
The human genome is extensively transcribed. The majority of transcripts are long non-coding RNAs (lncRNAs), defined as > 200 base pairs in length and transcribed antisense, intronic or intergenic to proteincoding genes. RNA sequencing studies conducted on different tissues and cell types are continually identifying new lncRNAs, which indicates that comprehensive annotation of lncRNA genes is far from complete [1][2][3][4][5][6]. Recent studies, using targeted RNA sequencing (also called RNA CaptureSeq), have revealed tremendous complexity of human transcriptomes-predominantly driven by pervasive transcription, complex alternative splicing, cell type, and context-specific gene expression [7,8]. However, to date, only a small number of lncRNAs have an assigned function and the exact proportion of non-coding transcripts that are functional is a subject of continued debate.
Genome-wide association studies (GWAS) in combination with fine-mapping have identified 196 independent signals associated with breast cancer risk (conditional p values < 1 × 10 −6 ) [9]. Sixty-six signals confer a greater risk of estrogen receptor (ER)-positive tumors, 29 confer a greater risk of developing ERnegative tumors, and the remaining signals showed no statistically significant difference in their effects on ER subtypes [9]. Due to complex linkage disequilibrium (LD), genetic variants within a signal are frequently co-inherited making it difficult to pinpoint the variant driving the association. A recent study has defined the credible causal variants (CCVs) within each signal as those with p values within two orders of magnitude from the lead variant (7394 CCVs/196) [9]. At 28 signals, a single CCV was identified, and at 96 signals, the number of CCVs was ≤ 10. Similar to other traitassociated variants identified by GWAS, the majority of CCVs are located in non-coding genomic regions [1].
We have recently shown that CCVs are enriched in genomic features associated with regulatory activity in a range of breast-derived cell lines including sites of open chromatin, chromatin marks associated with promoter and enhancer activity (H3K4Me3, H3K4Me1, and H3K27Ac), and transcription factor binding sites [9,10]. Capture Hi-C analyses have identified annotated gene promoters that frequently interact with these CCV-containing elements providing a list of candidate risk genes for these signals [10]. However, at 20 signals, we found no evidence of CCVs falling in regions marked by regulatory activity suggesting that some signals alter risk through alternative mechanisms [10]. Using RNA CaptureSeq, we annotated multiexonic non-coding RNAs (mencRNAs) transcribed from genome intervals surrounding breast cancer risk signals in a range of mammary-derived tissue and cell lines. We provide evidence that non-coding RNAs represent an alternative mechanism by which CCVs alter breast cancer risk. This likely extends beyond breast cancer and may represent a common mechanism by which GWAS variants act in other complex traits and diseases.

RNA CaptureSeq array design and capture
To identify non-coding RNAs (ncRNAs) transcribed from breast cancer risk signals, we performed RNA CaptureSeq on 21 breast-derived samples including primary mammary epithelial cells from reduction mammoplasties, breast tumor samples, and normal mammary epithelial and breast cancer cell lines (Fig. 1a, b; Additional file 2: Tables S1 and S2). Oligonucleotide probes were designed to capture RNA transcripts produced from intronic and intergenic regions within 1.5-Mb intervals surrounding known breast cancer GWAS signals at the time of capture design (139/196 signals; Fig. 1a; Additional file 2: Table S3). Additional control sequences were added to the capture design to assess performance, including sequences corresponding to ERCC (External RNA Controls Consortium) spike-in transcripts and control housekeeping genes (Additional file 2: Table S4). In total, 138 Mb (4.3%) of the human genome was covered. RNA sequencing libraries were generated, hybridized against the capture probes, and sequenced. In addition, pre-captured RNA-seq libraries from four cell lines were sequenced to compare enrichment of the captured transcripts before and after hybridization. Based on the ERCC controls, the lower limit of detection (LLD), defined as the lowest molar amount of ERCC transcript detected in each library, was~300 times lower across the four captured libraries compared to non-captured controls (Additional file 1: Figure S1). Moreover, the captured libraries showed a high enrichment of all control housekeeping genes included in the capture and no enrichment of off-target controls when compared to the non-captured libraries (p = 3.5 × 10 −10 ; Additional file 1: Figure S2a).

Annotation of mencRNAs at regions flanking 139 breast cancer risk signals
To discover novel ncRNA genes, sequencing reads from the 21 captured libraries were de novo assembled, mapped back to the genome, and quantified (Additional file 2: Table S2). ncRNA genes overlapping annotated coding sequences were excluded. Lowly expressed (maximum FPKM across the samples < 0.5) and single-exon genes were also removed to avoid artifacts derived from genomic DNA and transcriptional noise [7]. We identified 4020 mencRNA genes (FPKM ≥ 0.5), 2766 of which are novel (defined as not overlapping with lncRNAs reported by GENCODE v.19 [11], FANTOM-CAT [12], or NONCODE [13]) (Additional file 2: Tables S5 and  S6). mencRNA transcript lengths ranged from 143 to 35,678 bp with a median length of 1550 bp (Fig. 1c). The Coding Potential Calculator [14] predicted a high proportion of the transcripts (94.03%) as "non-coding," significantly more than GENCODE lncRNAs (chisquare p = 2.5 × 10 −5 ; Additional file 2: Table S7). In silico assessment of the mencRNA transcripts showed the vast majority of splice junctions contained canonical dinucleotides (99.0%), were identified by uniquely mapping reads (81.3%), and were not overlapping with repeat sequences (96.1%) (Additional file 1: Figure S2b-d). Furthermore, similar to other studies [2], we show a high prevalence of two-exon lncRNA transcripts (Additional file 1: Figure S2e). On average, the mencRNA transcripts showed 5.5 times lower expression compared to GENCODE lncRNAs detected by our RNA CaptureSeq (n = 328; Additional file 1: Figure S2f). This is likely due to the targeted RNA sequencing approach, which enabled the identification of transcripts with lower expression levels. Hierarchical clustering of the mencRNA expression profiles clustered the samples according to their estrogen receptor status, whether they were derived from normal or tumor samples or from primary tissue or cell lines (Fig. 1d). Thirty mencRNAs containing CCVs were prioritized for RT-PCR validation which was then performed on the captured transcripts, binned into three groups of ten based on their level of expression. Ninety percent of transcripts with a FPKM ≥ 2 were validated by RT-PCR (9/10) compared to 80% of transcripts with 0.5 ≤ FPKM < 2 and 30% of transcripts with FPKM < 0.5 (Additional file 1: Figure S3a). As exposure to estrogen is known to promote the development of breast cancer, we treated MCF7 breast cancer cells with 17beta-estradiol to determine if expression of any mencRNAs was estrogen-regulated (Additional file 1: Figure S3b). Nearly one third of the mencRNAs (n = 1189) were differentially expressed between MCF7 estrogen-treated and vehicle-treated control libraries (FDR < 0.01; Additional file 1: Figure S3c; Additional file 2: Table S8). We then quantified the expression of our captured transcripts in 111 normal breast samples and 1092 breast tumors using standard RNA-seq datasets from TCGA [15]. Approximately 75% of the mencRNAs were present at an expression level of FPKM ≥ 1 even though no capture step had been performed prior to sequencing (Additional file 2: Table S9). Consistent with other non-coding RNAs, our mencR-NAs had on average 140-fold lower expression compared with GENCODE protein-coding genes (Fig. 1e). Principal component analyses (PCA) based on mencRNA expression distinguished normal breast tissue from matched breast tumor, PAM50 breast cancer subtypes, and ER-positive versus ER-negative breast tumors (Fig. 1f, g; Additional file 1: Figure S3d). We also analyzed the expression of the mencRNAs in six additional tumor types and show that the mencR-NAs exhibit high tissue specificity compared to protein-coding genes (Fig. 1h). PCA analyses based on mencRNA expression also tightly clustered tumors based on tissue type (Additional file 1: Figure S3e).

Enrichment of CCVs for breast cancer risk in mencRNA exons
Several studies have shown that most genetic variants identified through GWAS are non-coding and enriched in cell type-specific enhancers relevant to the GWAS trait [9,16]. However, the contribution of non-coding RNAs to complex traits is not clear. We assessed the mechanisms underlying CCVs and mencRNAs and showed enrichment of CCVs in the exons, but not the promoters or introns of the mencRNAs (Fig. 2a). Conversely, CCVs were depleted in the exons of proteincoding genes and enriched in the introns. In total, 119 mencRNA genes at 69/139 signals contained a CCV (417 in total; Additional file 2: Tables S10 and S11) within a mencRNA exon. One example is the 2q14.2 risk region, where CCVs in 3 of the 4 independent risk signals fell within mencRNA exons (Fig. 2b, c). Of note, CCVs within signals 2 and 3 fell in different exons of the same mencRNA, XLOC-130206 (Fig. 2c).
Given the enrichment of CCVs in mencRNA exons, we hypothesized that CCVs could affect mencRNAs by modulating RNA stability. We therefore performed an expression quantitative trait loci (eQTL) study using the breast tumor TCGA datasets to identify genetic variations associated with mencRNA expression. We identified 800 mencRNAs which were eQTLs (FDR < 0.05), nine of these eQTLs overlapped with breast cancer signals, based on the p value for the top eQTL SNP being within two orders of magnitude of the eQTL p value for a CCV. We further examined the colocalization of the eQTL and breast cancer signals and show that seven of the nine eQTLs colocalize (Additional file 1: Figure S4; Additional file 2: Tables S12 and S13). Of these, three signals have at least one eQTL variant (eVariant) within a mencRNA exon (Additional file 2: Tables S12 and S13).  Figure S5a; Additional file 2: Tables S12 and S13). Notably, eQTL studies in multiple normal and breast tumor cohorts did not find an association between CCVs at this signal and any annotated protein-coding genes (at p < 5 × 10 −4 ), suggesting that XLOC-142280 is a likely target gene [17]. Using TCGA RNA-seq data, we show that XLOC-142280 is predominantly expressed in ER-positive breast cancers (Fig. 3d). Given that CCVs at this region are exclusively linked with ER-positive breast cancer, this association could be explained by the restricted breast cancer subtype expression of XLOC-142280 [9].

Evidence for distal CCVs modulating mencRNAs
For 4/7 of the eQTLs that colocalized with breast cancer risk signals, there were no eVariants in mencRNA exons. Therefore, we speculated that CCVs may also act distally through regulatory elements that interact with mencRNA promoters. Using Capture Hi-C data from breast cells [10], we identified 770 mencRNA promoters (defined as ± 500 bp from the transcription start site) that looped to a region containing a CCV (Additional file 2: Tables S11 and S14). For example, at 16q12.2, CCV rs11642015 is an eVariant (p < 5 × 10 −4 ) for the mencRNA XLOC-093918, with the risk haplotype associated with increased XLOC-093918 levels ( Fig. 4a; Additional file 1: Figure S5b; Additional file 2: Tables S12 and S13). CCV rs11642015 falls within a region of open chromatin marked by H3K27ac and H3K4me1, consistent with a putative enhancer element and contacts XLOC-093918 in B80T5 normal breast cells and ER-positive and ER-negative breast cancer cell lines (Fig. 4b, c; Additional file 1: Figure S5c). The CCV also contacts the IRX5/CRNDE bidirectional promoter; however, no eQTL for IRX5 or CRNDE or any other gene was detected in either normal or tumor tissues (at p < 5 × 10 −4 ) [17].
In another example, we show that promoters of three mencRNAs (XLOC-214919, XLOC-222497, and XLOC-222554) located at the 6q25/ESR1 risk region participate in chromatin looping with regions containing CCVs (Fig. 5a). Interestingly, we also observed looping between the ESR1 promoter and both XLOC-222497 and XLOC-222554 (Fig. 5a) and the expression of both mencRNAs was highly correlated with ESR1 (Fig. 5b), suggesting that these genes are co-regulated or are themselves regulatory mencRNAs that mediate cis-regulation of ESR1. Given the high correlation between the mencRNA-mRNA pairs at 6q25, which are connected by chromatin loops, we investigated whether mencRNA-mRNA pairs connected by loops across our captured regions were more highly correlated than mencRNA-mRNA pairs not connected. Although the effect size is marginal, we found that the expression of mencRNA-mRNA pairs that are physically connected through chromatin looping are significantly more correlated than a random set of mencRNA-mRNA pairs (Fig. 5c).
Evidence for mencRNAs being the target gene of more than one signal Recent fine-scale mapping of breast cancer GWAS risk regions have reported multiple independent risk signals, some of which target the same protein-coding gene [9,10]. We therefore sought to identify mencR-NAs that are targeted by more than one risk signal. We identified 222 mencRNA genes in which two or more independent CCVs fell either (i) within the exon of the mencRNA, (ii) within the promoter region of the mencRNA, or (iii) in a region that interacts through long-range chromatin interactions with the promoter of the mencRNA (Additional file 2: Table S14). For example, at 18q11.2, we identified an eQTL variant for XLOC-112072 (FDR = 1.95 × 10 −10 ) that overlaps signal 3 CCVs (Fig. 6a-c; Additional file 1: Figure S5d), two of which fall in the mencRNA exon. In addition, signals 1 and 2 loop to the promoter of XLOC-112072 in T47D breast cancer cells (Fig. 6b). Interestingly, only signal 1 interacts with the XLOC-112072 promoter in B80T5 normal breast cells, suggesting some level of cell-type specificity (Fig. 6b). Notably, no other protein-coding eQTLs were identified for signals 1, 2, or 3 (p < 5 × 10 −4 ), supporting a role for this mencRNA as one of the likely target genes at this breast cancer risk signal.

Discussion
To date, the major focus of GWAS follow-up studies has been the impact of regulatory variants on the expression of protein-coding genes. In many cases, genetic variants with stronger genetic associations with the trait of interest are dismissed as causal because they do not fall in regions marked for promoter or enhancer activity. Using a combination of targeted RNA sequencing and de novo transcript assembly, we annotated > 4000 mencRNA genes expressed from genomic intervals surrounding 139 breast cancer GWAS signals. We identify 844 mencRNAs as candidate breast cancer risk genes based on CCVs that fall in either (1) mencRNA exons, (2) mencRNA promoters, or (3) regions that interact with the mencRNA promoters through longrange chromatin interactions. We summarize the evidence for mencRNA involvement at breast cancer risk signals (Additional file 2: Table S11), which will facilitate future lab-based functional studies. Collectively, these results suggest that modulation of mencRNAs may represent an alternative mechanism by which CCVs contribute to risk.
Our eQTL analyses identified nine mencRNA eQTLs which colocalize with breast cancer risk signals. Of these, five were the only detectable eQTL for the signal providing evidence that these mencRNAs are at least one of the likely candidate target genes at the respective signal. Three of the seven mencRNAs had an eVariant within the mencRNA exon. However, there were other genetic variants in LD which were more distal (CCVs in the same signal), making it difficult to determine if the exonic eVariant is affecting the mencRNA stability and driving the association. For example, the CCVs within the same signal that lie outside of mencRNA exons could potentially effect mencRNA expression through altered cis-regulatory elements (e.g., mencRNA promoters or enhancers). Consistent with this, for 4/7 mencRNA eQTLs that overlap breast cancer risk signals, the eVariants fall outside the mencRNA exons suggesting that, in these cases, altered distal regulation is the more likely mechanism. We have already demonstrated this regulatory mechanism at the 11q13 breast cancer risk region [18]. CCVs at 11q13 fall within an estrogen-regulated enhancer of two lncRNAs called CUPID1 and CUPID2.
In heterozygous breast cancer cell lines, we showed allele-specific chromatin looping between the enhancer and the CUPID1/2 bidirectional promoter suggesting CCVs reduce their expression by inhibiting chromatin looping [18]. CUPID1/2 play a role in modulating pathway choice for the repair of double-stranded DNA breaks by promoting homologous recombination-based repair [18], providing a plausible mechanism by which CCVs alter breast cancer risk. In another example, prostate cancer risk variants at 8q24 increase the activity of an enhancer for the PCAT1 lncRNA. PCAT1 interacts with the androgen receptor and lysine-specific demethylase to promote prostate cell growth providing a mechanism by which genetic variants increase risk of prostate cancer [19].
We show that CCVs are enriched in the exons but not the introns or promoters of mencRNAs, suggesting that genetic variants may alter mencRNA structure and/or function. lncRNAs can act as protein scaffolds; therefore it is possible that a genetic variant could alter the binding of proteins to mencRNAs. For example, LINP1 promotes the repair of DNA breaks by acting as a scaffold for proteins involved in non-homologous end joining (NHEJ) [20]. The 2q12.1 region associated with celiac disease provides an example of genetic variants affecting lncRNA function at a GWAS locus. This study showed that risk-associated variants repress the expression of inflammatory genes by altering the secondary structure of a lncRNA called Lnc13 and ultimately the binding of hnRNPD to Lnc13 [21]. At some breast cancer risk signals where no eQTL association was found, we speculate that CCVs may be functional without affecting gene expression. This is certainly the case for somatic mutations in two lncRNAs, MALAT1 and NEAT1, where levels of these lncRNAs are not significantly altered in mutated vs non-mutated samples [22].
CCVs are also enriched in genomic features associated with cis-regulatory DNA elements including sites of open chromatin, chromatin marks associated with promoter and enhancer activity, and transcription factor binding sites [9,10]. Given that lncRNAs often arise from enhancer elements [23][24][25], it is possible that some CCV-containing enhancers fall within mencRNAs. Notably, we did not observe enrichment of CCVs in the introns or promoter regions of mencRNAs, suggesting that alteration of the lncRNAs transcribed from enhancers may have a functional effect on the enhancer activity. Several lncRNAs are required for promoting chromatin looping between enhancers and promoters [26][27][28]. Therefore, it is possible that genetic variants within the lncRNA could affect this process. Further work will be required to determine whether the CCVs that overlap mencRNAs and enhancers in this study act through the DNA element, for example, by altering transcription factor binding, or whether the variant affects the RNA transcript itself.

Conclusions
In summary, we have compiled the largest catalog of breast cancer-associated mencRNAs to date and provide evidence that modulation of mencRNAs by GWAS variants may provide an alternative mechanism underlying complex traits. These findings have broad implications for interpreting findings from GWAS and suggest that comprehensive annotation of ncRNAs expressed in relevant cell types around GWAS signals will be important for functional follow-up studies. We do not yet know what proportion of mencRNAs identified in this study are functional. However, even if a small percentage have evolved functions, this resource may contain hundreds of new genes that hold the key to understanding breast cancer etiology. lncRNAs display remarkable cell and tissue-specific expression making them excellent candidates for therapeutic targets. Understanding the function of these lncRNAs therefore holds great potential for the development of new breast cancer therapies.

Human samples
Normal breast samples were derived from reduction mammoplasty from four donors, and organoids were obtained according to Johnson et al. [37].

RNA CaptureSeq array design
Oligonucleotide probes were designed to capture noncoding sequences within 1.5-Mb intervals surrounding breast cancer GWAS signals using Roche NimbleGen's capture design algorithm. To evaluate performance of the RNA CaptureSeq, control sequences were added to the design including 92 ERCC spike-in transcripts and 9 housekeeping genes covering a range of expression levels across breast tissues (ABCB1, GATA1, GUSB, HMBS, HPRT1, NLK, RUNX2, TBP, and TFRC; Additional file 2: Table S4). In total, 85.8% of the target bases were captured providing a total capture size of 137.7 Mb (4.3% of the human genome). The vast majority (~92%) of capture probes were unique (matched to only one genomic sequence). To increase the coverage of the remaining targets, a small proportion had multi-sequence homology.

RNA CaptureSeq library preparation and capture sequencing
RNA was extracted from the primary mammary epithelial cells and normal mammary epithelial and breast cancer cell lines using the RNeasy Plus Mini Kit (Qiagen). Five micrograms of total RNA was rRNA depleted using the Ribo-Zero Gold rRNA removal kit according to the manufacturers' instructions (Illumina). RNA was extracted from the breast tumor samples using TRIzol (Thermo Fisher Scientific), then DNase-treated with Turbo DNase (Life Technologies). ERCC RNA spike-in control mix 1 or 2 (Life Technologies) were added to ribodepleted RNA (final dilution of 1/50) or to 200 ng of total RNA (from breast tumors; final dilution of 1/1250). RNA-seq libraries were generated using the KAPA stranded RNA-seq Library Preparation Kit (Roche) and 12 cycles of pre-capture LM-PCR. Multiplex library pools were created by mixing equal amounts of four pre-capture libraries and capture hybridization performed on 1 μg of the pooled library. Capture hybridizations were performed using the SeqCap RNA Enrichment System (NimbleGen) according to the manufacturer's instructions. Post-capture LM-PCR was performed for 14 cycles. One library pool (representing four original libraries) was sequenced per lane on the Illumina HiSeq2500 platform (Kinghorn Centre for Clinical Genomics, Sydney, Australia).

Assessment of RNA CaptureSeq performance
To evaluate fold enrichment, sensitivity, and specificity of the RNA CaptureSeq, we added control sequences to the capture design including ERCC spike-in transcripts and housekeeping genes. Before hybridization of the libraries to the probes, we made technical replicates from four pre-capture libraries (B80-hTERT1, Hs578T, MCF7, and MDA-MB-231) and sequenced.
Sequencing data from these non-captured and the corresponding captured libraries were used to evaluate the RNA CaptureSeq performance. Adapter trimming and quality control of the sequencing reads were performed with Trim Galore version 0.3.7. The reads were then mapped against hg19 reference genome appended with 92 ERCC spike-in transcripts using STAR version 2.4.2a. RSEM version 1.2.25 was used for transcript quantification. Expression data for the ERCC spike-in transcripts were used to assess sensitivity and fold enrichment of the RNA CaptureSeq. A dose-response plot for each library was generated by plotting normalized expression (log2 (FPKM)) of the ERCC transcripts against their known molar concentration added by a linear regression line of best-fit. To assess the sensitivity of the platform, we obtained a lower limit of detection (LLD) from each plot (defined as the lowest concentration of the ERCC transcripts present in each library that yields a detectable expression with the detection threshold FPKM of 0.5) and compared between captured and non-captured libraries. Upper limits of detection were also determined to assess probe saturation. Fold enrichment of the platform was calculated using linear regression equations of the captured and non-captured libraries. We evaluated off-target capture as a measure of specificity of the RNA CaptureSeq. Seven housekeeping genes included in the design and expressed in the samples (GUSB, HMBS, HPRT1, NLK, RUNX2, TBP, TFRC) were tagged as "targeted" and, for each, the nearest protein-coding gene not included in the design (VKORC1L1, H2AFX, PHF6, TMEM97, SUPT3H, PSMB1, and ZDHHC19) was selected and tagged as "non-targeted." For each gene, expression fold change between the captured and non-captured libraries was measured and log-transformed values were compared between the targeted and non-targeted genes using a paired Student's t test.
De novo assembly and transcript characterization As described previously [7], for each library, the sequencing reads were trimmed with Trim Galore version 0.3.7, assembled with Trinity version 2.2.0 [29], and mapped back to the hg19 reference genome with GMAP version 2015-11-20 [30]. The transcripts from all libraries were then merged together using the Cuffmerge function of Cufflinks version 2.2.1 [31]. The sequencing reads were then mapped against the merged assembly using STAR version 2.4.2a [32] and quantified using RSEM version 1.2.25 [33]. Single-exon genes, those not overlapping breast cancer GWAS regions, and those overlapping protein-coding sequences were removed. Moreover, we excluded lowly expressed genes (maximum FPKM across all libraries < 0.5) and isoforms with very low expression (maximum FPKM across all libraries < 0.01). The identified mencRNAs were assessed for coding potential by Coding Potential Calculator [14]. Splice junctions were annotated and characterized using STAR version 2.4.2a. We clustered the samples based on expression profiles of the mencRNAs using the "hclust" function from the R "Stats" package. EdgeR Bioconductor package [34] was used to identify mencRNAs differentially expressed between MCF7 estrogen-treated and vehicle-treated control libraries.

Annotation of the captured transcripts in TCGA cohort
The captured transcripts were filtered for multi-exon genes and those overlapping with GWAS regions, then merged with the GENCODE comprehensive gene annotation (release 19) to generate a custom reference annotation file. In the case of overlapping genes, if the RNA CaptureSeq gene was lowly expressed (maximum FPKM across all libraries < 0.5), it would be removed and otherwise the GENCODE gene would be excluded. TCGA RNA-seq data for breast invasive carcinoma cohort (1226 samples), cervical squamous cell carcinoma (n = 309), esophageal carcinoma (n = 165), mesothelioma (n = 86), prostate adenocarcinoma (n = 552), skin cutaneous melanoma (n = 463), and uterine corpus endometrial carcinoma (n = 175) cohorts was obtained from Cancer Genomics Hub. Adaptor trimming of the sequencing reads was performed using Cutadapt version 1.11. Using the custom reference annotation file, the reads were then mapped with STAR version 2.5.2a and quantified with RSEM version 1.2.30 [32,33]. Since TCGA RNA-seq data is not strand-specific, we excluded the captured transcripts which overlap (≥ 1 bp) with exons of protein-coding genes on the opposite strand in order to prevent bias when assessing expression levels. For differential expression analysis, genes with low expression (less than five counts per million in less than five samples) were excluded. The remaining expression data were imported into edgeR Bioconductor package [34] and normalized for library size and RNA composition. Differential expression analysis was then performed using edgeR functions glmFit and glmLRT. We used the R function prcomp for principal component analysis (PCA). For co-expression analysis, genes with low expression (expression counts in less than 20% of samples) or low variability (median absolute deviation (MAD) across the samples < 1) were excluded. Pairwise correlations were then computed using the R cor function. Tissue specificity index (Tau) [38] was measured using log2-transformed expression data for primary tumor samples from the seven TCGA cancer cohorts.

RT-PCR validation of the captured lncRNAs
We used RT-PCR to validate 30 captured lncRNAs, binned into three groups based on their level of detection in T47D breast cancer cells. mencRNAs expressed at FPKM > 2 were classified as high expressed, 0.5 ≤ FPKM < 2 as moderate expressed, and FPKM < 0.5 as low expressed. Five micrograms of total RNA was first reverse transcribed using SuperScript III (Life Technologies), and PCR was performed using MyTaq DNA polymerase according to the manufacturers' instructions. Forty cycles were performed in order to identify low expressed transcripts. Forward primers and reverse primers were designed to different mencRNA exons to avoid amplification of genomic DNA. Primers are listed in Additional file 2: Table S15.

CCV enrichment analyses
We first examined the overlap of CCVs and mencRNAs genes. Variants were intersected with mencRNA exons, introns, and promoter sequences (defined as 500 bp upstream of the transcription start site). Fold enrichment was calculated by dividing the proportion of CCVs that overlapped the annotations by the proportion of coverage of that annotation within the captured regions. The significance of the overlap was calculated by comparing it to the expected using a hypergeometric test using Bedtools fisher. To further analyze CCV overlap, positions of each annotation were permuted around circularized capture regions. This approach maintained the relative sizes and positions of annotations belonging to the same transcript. Overlap of CCVs was compared to the mean overlap for 10 5 random permutations.

mencRNA eQTL analyses
RNA-seq reads from TCGA breast invasive carcinoma cohort were mapped against mencRNAs and quantified as described above. For these analyses, we did not filter the mencRNAs based on RNA CaptureSeq expression. The expression data for tumors (n = 1112) were then analyzed for eQTL detection. Patient germline SNP genotypes (Affymetrix 6.0 arrays) were processed and imputed to the 1000 Genomes reference panel (October 2014) as previously described [17]. Tumor tissue copy number was estimated from the Affymetrix 6.0 arrays and called using the GISTIC2 algorithm [35]. Complete genotype, RNA-seq, and copy number data were available for 678 genetically European patients. The FPKM values were log2 transformed and quantile normalized. Genetic variants within the captured regions were extracted from the imputed genotype dataset. The associations between genetic variants and mencRNA expression in breast tumor tissue were evaluated using linear regression models by the MatrixEQTL program in R [36]. Tumor tissue was also adjusted for copy number variation, as previously described [39]. A false discovery rate of 5% was used to report eQTL results from breast tissue. We considered eQTLs to overlap breast cancer risk signals based on the p value for the top eQTL SNP being within two orders of magnitude of the eQTL p value for a CCV. Colocalization analysis of eQTL and breast cancer risk signals was performed as described by Liu et al. [40].
ATAC-seq and Capture Hi-C ATAC-seq and Capture Hi-C in breast cells were obtained from Beesley et al. [10].
Additional file 1: Figure S1. RNA CaptureSeq performance. Figure S2. Quality control and properties of captured transcripts. Figure S3. Validation of captured transcripts. Figure S4. Co-localisation of eQTL associations at breast cancer GWAS signals. Figure S5. eQTL analysis and chromatin interactions at breast cancer risk regions.
Additional file 3. Review history.