A compendium of SSVs and gene expression across 2334 cases
Our study brought together all available WGS and RNA-seq data for 2334 cases, of which 1482 cases had DNA methylation array data (450K platform) being uniformly generated and processed as part of TCGA consortium. RNA-seq data were previously processed as part of efforts by PCAWG consortium or by TCGA consortium or both, with batch correction being carried out here to harmonize the two sets of data into one unified set of 2334 sample profiles (Additional file 1: Figure S1A). Gene-level CNA values by WGS or SNP array were similarly harmonized together (Additional file 1: Figure S1B). WGS data were generated with either low depth of coverage (low pass, ~ 6–8×) or high coverage (high pass, ~ 30–60×), with both types of WGS being utilized effectively to identify SSVs in previous studies [3, 5]. We compiled SSV calls for 1232 cases from PCAWG [4], for 1207 low-pass WGS cases from our recent study [3], and for 764 high-pass WGS cases from TCGA [2] (Additional files 2 and 3). Cases sequenced with high coverage will have more SSVs detected on average and fewer false negatives [3]. Of the 2334 unique cases, 1033 had only low-pass WGS data available. Data integration (e.g., between SSVs and RNA-seq or between SSVs and DNA methylation arrays) was a key aspect of our study in identifying gene features with significance levels (whether by statistical modeling or permutation testing) rising above any noise inherent in one data platform, with statistical corrections being considered as warranted for technical covariates such as sequencing coverage or tumor sample purities.
Using a previously described integration approach between SSVs and gene expression [3, 13], we assessed gene-level associations between expression and nearby SSV breakpoints within several specified genomic region windows in relation to genes (upstream, downstream, or within the gene body, Fig. 1a). For each of the genomic regions relative to genes considered (e.g., within the gene, 0–20 kb upstream, 20–50 kb upstream, 50–100 kb upstream, 0–20 kb downstream, 20–50 kb downstream, and 50–100 kb downstream), we found widespread associations between SSV event and expression across the 2334 cases and 17,798 genes as expected, after correcting for expression patterns associated with tumor type or CNA (Fig. 1b, c and Additional file 4), indicative of SSV-mediated gene regulatory disruption. For each of the significant gene sets for a given genomic region window (using false discovery rate, or FDR, of < 5% [17]), many more genes were positively correlated with SSV event (i.e., expression was higher when SSV breakpoint was present) than were negatively correlated, though notably a substantial number of genes—including tumor suppressors such as PTEN, STK11, TP53, and RB1—were negatively correlated with SSV breakpoints occurring within the gene body, indicative of direct disruption of gene coding regions. In a few cases, within-gene SSV breakpoints associated with increased expression represented gene fusions, e.g., involving ERG, ALK, or RET (Fig. 1c and refs [3, 18]). Without statistical corrections for CNA, we found even larger numbers of genes with SSVs associated with increased expression (Fig. 1b), in line with previous observations of SSV breakpoints being strongly associated with copy gain [3]. The numbers of statistically significant genes were much lower when considering genomic region windows very far away from the gene, e.g., greater than 1 Mb.
The significances of association between expression and nearby SSV breakpoint, as made for all genes in the present study, were compared to that of both the previous study utilizing low-pass WGS [3] and the previous study of PCAWG high-pass WGS [5], involving 1448 and 1220 cases, respectively. Overall, we observed high concordances, between the results from the present study utilizing all 2334 cases and previous results from studies utilizing different subsets of the 2334 cases (Fig. 1d). Genes with known associations with cancer [14,15,16] that were significantly positively correlated with nearby SSV breakpoints in each of the three separate studies (FDR < 10% for any one of the following regions: 0–20 kb upstream, 20–50 kb upstream, 50–100 kb upstream, 0–20 kb downstream, 20–50 kb downstream, 50–100 kb downstream, or within the gene body, with corrections for cancer type and CNA) included AKT3, ALK, BCL9, CCND3, CDK4, ERBB2, ERG, IRF4, LRP1B, LZTR1, MDM2, MYC, PPARG, RET, TERT, and TP63; genes significantly negatively correlated included ARID1B, ARID2, GRLF1, PHF10, PTEN, RB1, SMAD4, STK11, TP53, and ZNF384. Here we found many genes that were not significant in one of the previous studies to be significant in the analysis of the combined datasets. This could be attributable in part to greater statistical power represented by the additional cases in the larger dataset, as well as to differences in the representation of cancer types between different studies. For example, SSVs associated with the upregulation of CD274 and PDCD1LG2 involved ~ 1% of non-amplified cases in the PCAWG cohort of 1220, but these significant patterns involved lymphomas and other cancer cases that were not represented in the other previous study of 1448 cases, and so these genes were not significant in that study.
An analytical approach to identify gene features consistently altered by nearby SSV breakpoints
In light of previous findings using the above genomic region windows method, we developed an alternative analytical approach to associate the expression of each gene with nearby SSV breakpoints. Given that genomic rearrangements may involve the translocation of enhancers, which may impact genes within a distance of ~ 1 Mb [3, 19], and given that examining relatively small regions on the order of ~ 20 kb may result in breakpoints falling just outside the window that would otherwise contribute to a significant pattern, we computed a “relative distance metric” for each sample and gene (Fig. 2a), which was the relative distance of the SSV breakpoint closest to the gene start site (upstream or downstream). A data matrix of absolute relative distances for all 17,998 genes and 2334 samples was assembled, with a relative distance metric of 1 Mb being applied for any sample with no breakpoints within 1 Mb of the gene. Using this breakpoint pattern matrix, the correlation between expression of each gene and the presence of nearby SSV breakpoints could be assessed, using linear regression models (on log-transformed expression and relative distance data) that would allow for incorporation of any relevant covariates. Among other things, the distance metric method provides a single result for each gene across the samples, representing genes consistently altered across the entire ± 1 Mb region examined, avoiding the issue of multiple testing of several adjacent regions.
By the distance metric method, we found hundreds of genes consistently altered in expression by SSV breakpoints occurring within a region ± 1 Mb of the given gene, after correcting for CNA (Fig. 2b and Additional file 1: Figure S2 and Additional file 5). Most genes surveyed (13,023 out of 17,798) showed a significant increase in gene copy number with nearby SSV breakpoints, but after corrections for CNA and cancer type, 626 genes showed altered expression with nearby breakpoints at a statistical cutoff of FDR < 5%, 521 of these genes being positively correlated with breakpoints and 105 genes being negatively correlated. We found more genes significant by the distance metric method than were significant for any single region examined by genomic region windows method (Fig. 1b). In addition to CNA, other possible covariates considered included tumor purity, tumor ploidy, low-pass versus high-pass WGS, total number of SSV breakpoints detected per sample, and patient age, none of which represented major confounders (Additional file 1: Figure S3A). In addition, permutation testing, whereby we randomly shuffled the SSV events (by shuffling the patient ids) and distance metric method carried out for each gene, again demonstrated far more significant differences over chance expected (Additional file 1: Figure S3B), consistent with the above FDR estimates by Storey and Tibshirani method [17]. Cases involving overexpression of a gene with nearby SSV spanned all cancer types examined (Additional file 1: Figure S3C). Gene fusion events accounted for a small minority of SSV-mediated gene overexpression (Additional file 1: Figure S3C), and cases with low-pass WGS as well as cases with high-pass WGS contributed substantially to the significant gene patterns observed (Additional file 1: Figure S3C). Significant SSV-gene associations involved all SSV classes and sizes (Additional file 1: Figure S3D and S3E). A set of 37 microRNAs also showed significant associations between nearby breakpoint and increased expression (Additional file 1: Figure S3F and S3G and Additional file 5), which is likely due in part to many microRNAs residing within host genes [20].
Significantly enriched gene categories (by Gene Ontology or GO, Fig. 2c and Additional file 5) within the set of 521 genes positively correlated with nearby SSV breakpoints (FDR < 5%, corrections for CNA and cancer type) included G-protein-coupled receptor activity (78 genes, p < 1E−20 by one-sided Fisher’s exact test) and signal transducer activity (94 genes, p < 1E−10); enriched gene categories within the set of 105 genes negatively correlated included BRAF-type complex (ARID1B, ARID2, PHF10, RB1), protein deubiquitination (BAP1, FAM188A, KEAP1, PTEN, RNF135, SMAD4, STAMBP, TP53, USP22), cell cycle arrest (APBB2, CDKN2A, CDKN2B, KILLIN, RB1, STK11, TP53), and proteolysis (18 genes). Overall, the results obtained by the distance metric method were consistent with those of genomic region windows method, although a number of cancer-associated genes were significant by the latter method but not the former method (Fig. 2d); while the distance metric method may aid in our obtaining a more focused set of genes impacted by breakpoints occurring across a larger genomic region, there were other genes impacted just within a smaller and very specific region, as uncovered by the genomic region windows method. A number of well-known oncogenes and tumor suppressor genes had expression impacted by SSV breakpoints in a sizable number of cases—on the order of 1 to 4% of the 2334—that did not harbor amplifications or deletions in the given gene (Fig. 2d), with oncogenes including TERT [13], CRKL [21], FGF4 [22], IGF2 [23], and long noncoding RNA (lncRNA) BCAR4 [24] (Fig. 2e and Additional file 1: Figure S3H). lncRNA MALAT1 was also positively correlated with SSV breakpoints (Additional file 1: Figure S3H), though likely not representing an oncogene itself but rather a correlate of aggressive cancers [25].
In contrast to the genomic region windows method, the distance metric method allows for inferring SSV-gene associations by individual cancer type, as this method incorporates SSV breakpoint information across a larger genomic region, where SSV events may be sparse. For each of the 23 major cancer types represented in our patient cohort, we assessed the gene-level associations between expression and the presence of nearby breakpoints. For most cancer types surveyed, on the order of hundreds of genes were significantly associated (FDR < 10%, correcting for cancer type and CNA) with SSV breakpoints (Fig. 3a and Additional file 6). As with the pan-cancer analysis, more genes were positively correlated with breakpoints than were negatively correlated. Analogous to findings from genomic surveys for significantly mutated genes [15], a large number of genes found significant in the analysis of individual cancer types did not reach significance when analyzing the combined pan-cancer set of 2334 cases (Fig. 3b). Out of 1280 genes significant for any one individual cancer type (FDR < 10%, correcting for cancer type and CNA), just 41 were significant in pan-cancer analysis (Additional file 1: Figure S4A). The set of SSV-associated genes for each cancer type was distinct from those of the other cancer types (Additional file 1: Figure S4A).
Interestingly, when we compared the cancer type-specific SSV-gene associations with differential DNA methylation patterns (involving n = 1482 cases out of the 2334), we observed highly significant overlaps, between the genes with positive correlations between expression and nearby SSV breakpoint for a given cancer type and the genes with high overall DNA methylation being associated with that same cancer type (Fig. 3c and Additional file 1: Figure S4B). We examined DNA methylation array probes for 11,203 CpG islands (CGIs), defining the top CGI probes for each cancer type having high methylation versus the rest of the cancers (FDR < 0.001, t test using logit-transformed data). Of the 20 cancer types with methylation data, nine showed a significant overlap (p ≤ 0.002, chi-squared test) between genes associated with the top differentially methylated features and genes positively correlated between expression and SSV breakpoints in cancer type-specific analyses. In all, 893 CGI DNA methylation probes—involving 193 genes—were involved in the significant patterns of overlap as described above (Fig. 3d and Additional file 5). One of the involved genes was TERT, for which several CGI probes were highly methylated in chromophobe renal cell carcinoma (chRCC) and for which associations between SSV breakpoints and TERT expression were significant specifically for that cancer type (Fig. 3d). The TERT-associated CGI probes with high methylation were located within the TERT gene boundaries and did not include the CGI probe known to represent a repressive regulatory element (probe cg02545192, Additional file 1: Figure S4C). Of note, the discovery of SSV-mediated deregulation of TERT in solid tumors was first made in the chRCC cancer type [13], though the associations involving DNA methylation had not previously been made.
Widespread impact of SSVs on methylation of specific CpG islands (CGIs) across 1482 cases
While previous studies have examined the global influence of SSVs on the expression of individual genes, an analogous survey of associations between SSVs and DNA methylation patterns remained to be carried out. The gene by sample breakpoint matrices as constructed above for analysis of gene expression (Figs. 1a and 2a) were joined to the DNA methylation data matrix of 1482 cases, in terms of the genes associated with CGIs. The correlation between methylation of each CGI and the presence of an SSV breakpoint in relation to the CGI-associated gene was assessed using linear regression models, with the inclusion of relevant covariates. We examined 111,203 CGI DNA methylation probes, involving 13,043 associated genes. After correcting for any associations that would be attributable to CNA [12], we found hundreds of CGI probes consistently altered in methylation by nearby SSV breakpoints, whether by examining smaller regions by genomic region windows method or by surveying the ± 1 Mb region surrounding each gene by the distance metric method (Fig. 4a and Additional file 7). More CGI features were positively correlated (i.e., showed increased methylation) with SSV breakpoints than were negatively correlated. By the distance metric method, 1286 significant CGI probes (FDR < 5%, Fig. 4a and Additional file 1: Figure S5) included 802 probes positively correlated with nearby breakpoint and 484 probes negatively correlated. By and large, a number of possible covariates considered did not represent major confounders of the methylation-SSV associations (Additional file 1: Figure S6A), although a number of CGI probes negatively correlated with nearby SSV appeared influenced in part by the total number of SSV breakpoints detected per sample and by sample-wide DNA methylation (Additional file 1: Figure S6A), which associations were further explored below. Along with statistical modeling, permutation testing by random shuffling of the SSV events also demonstrated far more significant differences over chance expected (Additional file 1: Figure S6B).
Strikingly, CGI probes with SSV-associated increased methylation were predominantly promoter-associated, while CGI probes with SSV-associated decreased methylation were enriched for gene body CGIs (Fig. 4b, c; Additional file 1: Figure S6C). Of the 802 probes positively correlated (FDR < 5%) with nearby breakpoint, 581 (72%) were promoter-associated according to ENCODE annotation of the 450K platform (Additional file 7), where ~ 377 probes (47%) would have been expected by chance (p < 1E−45, chi-square test). Of the 484 probes negatively correlated with nearby breakpoint, these were anti-enriched for promoter-associated probes (Fig. 4b), but included 321 probes located within the gene body (between the ATG and stop codon, by 450K platform annotation), as compared to ~ 142 expected by chance (p < 1E−70, chi-square test). Interestingly, while promoter methylation is known to silence genes, gene body methylation is often positively correlated with gene expression [26], though our significant CGIs would represent only a fraction of the total surveyed. Significant SSV-CGI associations involved all SSV classes and sizes (Additional file 1: Figure S6D and S6E).
We examined the overlap between CGI probes with SSV-associated altered methylation and the related genes with corresponding SSV-associated altered expression (Fig. 4d), taking particular note of inverse correlations between methylation and expression. Using FDR cutoffs of < 10% (distance metric method, with cancer type and CNA corrections), 1126 CGI methylation probes were positively correlated with nearby SSV breakpoints, of which 58 probes involved genes negatively correlated between expression and SSV breakpoints, a highly significant overlap (p < 1E−35, chi-squared test), with 49 of the 58 probes also showing inverse correlation between expression and methylation across the 1482 cases (FDR < 10%, Pearson’s correlation using log- or logit-transformed values, respectively, with corrections for cancer type and CNA), which genes included TP53 and PTEN (Fig. 4d and Additional file 1: Figure S6F). Out of 824 CGI methylation probes negatively correlated with nearby SSV breakpoints, 54 involved genes positively correlated between expression and SSV breakpoints (p < 1E−7, chi-squared test), which genes included DVL1 and FASN (Fig. 4e and Additional file 1: Figure S6F). Previous findings made elsewhere [5] of other types of DNA methylation patterns associated with genes positively correlated with nearby SSV breakpoints were also observable in the present study, namely an enrichment for genes positively correlated between expression and methylation (Additional file 1: Figure S6G), as well as an association of methylation at the TERT-associated cg02545192 site (which involves a repressor element) with increased TERT expression (Additional file 1: Figure S6H). For most individual cancer types, on the order of hundreds of CGI probes were significantly associated with altered methylation with nearby SSV breakpoints (Additional file 1: Figure S7A). Out of 16,096 probes significant by methylation analysis for any one individual cancer type (FDR < 10%, Additional file 1: Figure S7B), 143, representing 55 genes, showed an opposite and significant correlation between expression and SSVs for the corresponding gene within the same cancer type (Fig. 4f).
We explored potential mechanisms involving SSV-mediated alterations in gene expression and in DNA methylation, including disruption of topologically associated domains (TADs) and enhancer hijacking. Using published data on TAD coordinates in human cells [27], we categorized all SSVs in our pan-cancer dataset, by those that were TAD disrupting (i.e., the breakpoints span two different TADs) versus those that were non-disrupting (i.e., both breakpoints fell within the same TAD). As expected [3], for SSVs with breakpoints located in proximity to a gene and associated with its overexpression, a high enrichment (p < 1E−18, chi-square test) for TAD-disrupting SSVs was observed (Fig. 5a), though no similar enrichment pattern was observed for SSVs associated with gene underexpression. Interestingly, SSVs associated with higher CGI methylation or with lower CGI methylation were both highly enriched for TAD-disrupting SSVs (Fig. 5a and Additional file 8, p < 1E−18, chi-square test). We went on to examine potential enhancer hijacking events involving SSVs, focusing here on a set of active, in vivo-transcribed enhancers as cataloged previously [28]. For all SSV breakpoint associations occurring 0–500 kb upstream of a gene and with breakpoint mate on the distal side from the gene, we tabulated SSV breakpoint associations involving the translocation of an active, in vivo-transcribed enhancer within 0.5 Mb of the gene (assuming no other disruptions involving the region), where the unaltered gene had no enhancer within 1 Mb. In line with previous findings [3], the subset of SSV breakpoint associations involving gene overexpression showed a statistically significant percentage (8.1% versus 6.2%) involving putative enhancer translocation events (Fig. 5b and Additional file 9, p < 1E−6, chi-squared test). However, as might be expected, we observed no such enrichment patterns for SSVs associated with gene underexpression or with altered DNA methylation (Fig. 5b).
Interestingly, rearrangement of regions with higher or lower methylation from other parts of the genome was associated with SSV-associated DNA methylation alterations. Using the TCGA dataset of DNA methylation of 16 types of normal adjacent tissues and ~ 450K probes (Additional file 1: Figure S7C), for each SSV breakpoint, the average DNA methylation in normal tissues represented by the rearranged region (using window of 50 kb) was compared with the normal tissue methylation of the CGI nearby the gene on the other side of the breakpoint. This average difference in normal methylation represented by SSV breakpoint was computed for all SSV-CGI associations, as well as for the subset of SSV-CGI associations involving higher or lower DNA methylation for the CGI in the cancer sample (using FDR < 5% for the CGI by distance metric and > 0.4SD or < −4SD from sample median for the case harboring the breakpoint, with corrections for cancer type and CNA). On average, all SSV-CGI associations showed an increase in methylation represented by the rearranged region relative to the CGI on the other side of the breakpoint (average beta difference of + 0.37, Fig. 5c, Additional file 1: Figure S7C, S7D, and S7E), which reflects the fact that regions of the genome outside of CGIs tend to have higher DNA methylation (Additional file 1: Figure S7C). However, when considering the subset of SSV-CGI associations involving higher methylation in the cancer, the increase in methylation involving the rearranged region was even greater (average beta difference of + 0.47, Spearman’s p < 1E−6, Fig. 5c), and SSV-CGI associations involving lower methylation showed a highly significant decrease in methylation represented by the rearranged region (average beta difference of − 0.11, Spearman’s p < 1E−50, Fig. 5c). The top SSV-CGI associations involving the rearrangement of a region of low methylation (average methylation beta difference < − 0.1), with corresponding decrease in methylation and increase in expression being observed in the cancer case (< − 4SD and > 0.4SD from median, respectively), involved 314 unique cancer cases and 549 unique genes (Additional file 10), including genes such as TERT (14 cases) and FASN (5 cases, Fig. 5d).
Widespread molecular alterations associated with the overall burden of structural variation
As another line of investigation, we examined gene expression and DNA methylation features that were correlated with the total number of SSV breakpoints detected per sample, independent of where the breakpoints were located in relation to genes. The hypothesis explored here is that cancer cases with a high burden of structural variation may show an altered molecular profile as a result of the extensive DNA damage involved. As low-pass WGS on average would yield more false negative SSV events, low-pass versus high-pass WGS was incorporated as a covariate in the statistical modeling, along with the TCGA or ICGC project to factor in other technical differences between projects. We found thousands of molecular correlates at both the mRNA and DNA methylation levels (FDR < 10% by linear regression model), when factoring the above covariates as well as other potential biological or technical covariates (Fig. 6a, b and Additional file 11). Most covariates considered—which included CNA, proximal BP pattern (from Fig. 2a), patient age, tumor purity, and overall methylation—did not represent major confounders in inferring molecular correlations with the total number of SSV events detected across samples, with the one notable exception being overall methylation (the median beta across all 450K probes in the sample profile) as applied to the DNA methylation analyses, whereby thousands of CGI probes that had been significantly negatively correlated with the total number of SSVs lost significance when overall methylation was factored into the model, indicating that globally lower methylation levels overall could be associated with the total number of SSVs.
In addition to specific mRNA and DNA methylation features, other molecular variables were associated with the total number of SSVs across samples. A slight but significant decrease in overall DNA methylation (median beta of the 450K probes) with increasing numbers of SSV was observed, independent of high-pass or low-pass WGS (Fig. 6c, Additional file 1: Figure S8A-D), Spearman’s r = − 0.17, corrected p < 1E−5), and accounting for many of the individual CGI probes that were significantly negatively correlated with the total number of SSVs (Fig. 6b). Significantly enriched GO gene categories (Fig. 6d and Additional file 11) within the set of 2661 genes positively correlated with the total number of SSVs per sample (FDR < 1%, with corrections for cancer type, CNA, and low-pass versus high-pass WGS) included cell cycle process (340 genes, p < 1E−50 by one-sided Fisher’s exact test), chromosome organization (173 genes, p < 1E−45), cell division (145 genes, p < 1E−30), DNA repair (157 genes, p < 1E−25), double-strand break repair (63 genes, p < 1E−14), and methyltransferase complex (33 genes, p < 1E−7); enriched gene categories within the set of 1611 genes negatively correlated included immune system process (374 genes, p < 1E−50), immune response (214 genes, p < 1E−45), leukocyte activation (169 genes, p < 1E−25), and apoptotic signaling pathway (43 genes, p < 0.0001). Consistent with the observed overexpression of cell cycle genes, a trend of increased number of SSVs detected in tumor samples and overall patient survival was observed (Additional file 1: Figure S8E). Other tumor sample variables associated with the total number of SSVs (p < 0.0005, corrected Pearson’s) included tumor sample purity, tumor ploidy, tumor aneuploidy, overall CNA, and patient age (Fig. 6e and Additional file 1: Figure S8F and S8G). Genes associated with both low expression and high DNA methylation with increasing numbers of SSVs included ADRA1A, GATA3, TCF21, and SOX17 (Fig. 6e and Additional file 1: Figure S8H).
Based on the above GO term enrichment patterns (Fig. 6d), we examined gene transcription signatures of DNA damage response, of methylation, and of immune cell types across the 2334 cases with expression data (Fig. 7a). Gene signature analyses using a previously curated list of 276 genes encompassing all major DNA repair pathways [30] showed several of these to be elevated in cases with high numbers of SSVs, including signatures of base excision repair, mismatch repair, Fanconi anemia, and homologous recombination (Fig. 7a). The associations involving DNA double-strand break repair pathway and Fanconi anemia, in particular, were also evident when examining key individual genes, including BRCA1, BRCA2, FANCD2, FANCI, and RAD51 (Fig. 7b). Key genes involved in histone methylation and DNA methylation, which appeared significantly increased with increasing numbers of SSV events, included EZH2, WDR77, PRMT6, DNMT1, DMNT3A, and DMNT3B (Fig. 7c). Analysis of gene expression signatures from Bindea et al. [16, 31] suggested that levels of immune cell infiltrates (e.g., B cells, T cells, and dendritic cells) were higher within tumors harboring fewer SSVs, which was also evident from the analysis of canonical immune cell gene markers (Fig. 7a, d). A DNA methylation signature of leucocyte fraction [32] also showed a similar pattern of anti-correlation with increasing total numbers of SSVs (Additional file 1: Figure S8F, Pearson’s corrected p < 1E−8), indicating that some of the global DNA methylation patterns as well as the gene expression patterns identified may involve non-cancer cell types.