Systems-epigenomics inference of transcription factor activity implicates aryl-hydrocarbon-receptor inactivation as a key event in lung cancer development

Background Diverse molecular alterations associated with smoking in normal and precursor lung cancer cells have been reported, yet their role in lung cancer etiology remains unclear. A prominent example is hypomethylation of the aryl hydrocarbon-receptor repressor (AHRR) locus, which is observed in blood and squamous epithelial cells of smokers, but not in lung cancer. Results Using a novel systems-epigenomics algorithm, called SEPIRA, which leverages the power of a large RNA-sequencing expression compendium to infer regulatory activity from messenger RNA expression or DNA methylation (DNAm) profiles, we infer the landscape of binding activity of lung-specific transcription factors (TFs) in lung carcinogenesis. We show that lung-specific TFs become preferentially inactivated in lung cancer and precursor lung cancer lesions and further demonstrate that these results can be derived using only DNAm data. We identify subsets of TFs which become inactivated in precursor cells. Among these regulatory factors, we identify AHR, the aryl hydrocarbon-receptor which controls a healthy immune response in the lung epithelium and whose repressor, AHRR, has recently been implicated in smoking-mediated lung cancer. In addition, we identify FOXJ1, a TF which promotes growth of airway cilia and effective clearance of the lung airway epithelium from carcinogens. Conclusions We identify TFs, such as AHR, which become inactivated in the earliest stages of lung cancer and which, unlike AHRR hypomethylation, are also inactivated in lung cancer itself. The novel systems-epigenomics algorithm SEPIRA will be useful to the wider epigenome-wide association study community as a means of inferring regulatory activity. Electronic supplementary material The online version of this article (doi:10.1186/s13059-017-1366-0) contains supplementary material, which is available to authorized users.


SUPPLEMENTARY FIGURES
: ESTIMATE immune-cell scores in GTEX dataset. Boxplots of immune-cell infiltrate scores in the GTEX RNA-Seq dataset, stratified according to tissue-type, as obtained using the ESTIMATE algorithm [1]. Tissues have been ranked in decreasing order of immune-cell infiltration. Observed how lung is ranked 3 rd , after blood and spleen.

Fig.S2: Enrichment of ChIP-Seq binding sites among LungNet TF target genes.
Heatmap tabulates the enrichment P-values (as derived using a one-tailed Fisher's exact test) for direct binding targets derived from ChIP-Seq profiles (using window sizes of +/-1kb, 5kb and 10kb, as indicated) of the given TFs, among the corresponding inferred targets using LungNet. For each TF, all available ChIP-Seq profiles for that TF were integrated, irrespective of cell/sample type, as derived from the ChIP-Atlas resource (http://chip-atlas.org). In the table, NA indicates that ChIP-Seq data for that TF was not available, or that not sufficient binding targets were found. In red, we highlight the significant P-values of enrichment, after correction for multiple testing using Benjamini-Hochberg procedure. The P-values are near identical to those obtained using 10,000 Monte-Carlo randomizations whereby for each TF, an equal number of targets in LungNet were randomly selected from the full GTEX dataset. LungNet (x-axis), we plot the t-statistics of differential activity between lung and all other tissues (y-axis), as estimated in the GTEX dataset [2]. Red dashed line indicates the line P=0.05. Observe how all but one TF (NR2F2) pass the nominal significance threshold.  [3]. For each of the 38 TFs in LungNet we performed 1000 distinct randomizations, whereby its gene targets were randomized among all possible non-TF genes. For each randomization, we recomputed a t-statistic of differential activity between lung tissue (n=8) and all other tissue types (n=192). Boxplot compares the t-statistics of differential activity for the observed (ie unpermuted case, red) against the average over the 1000 randomizations (grey). P-values are from a Wilcoxon rank sum test: from left to right, the P-value testing that the observed tstatistics are larger than 0, the P-value between the observed (red) and randomized (grey) values, and the P-value for testing that the t-statistics from the randomized case are higher than 0. The density curves in the right panel compare the observed average value over the 38 TFs (red line) to the distribution of the average over the 1000 different randomizations. No randomization led to an average t-statistic larger than the observed one (P<0.001).  For each of the 38 TFs in LungNet we performed 1000 distinct randomizations, whereby its gene targets were randomized among all possible non-TF genes. For each randomization, we recomputed a t-statistic of differential activity between lung tissue (n=7) and all other tissue types (n=53) in the SCM2 Illumina 450k set. Boxplot compares the t-statistics of differential activity for the observed (ie unpermuted case, red) against the average over the 1000 randomizations (grey). P-values are from a Wilcoxon rank sum test: from left to right, the Pvalue testing that the observed t-statistics are larger than 0, and the P-value for testing that the t-statistics from the randomized case are higher than 0. The density curves in the right panel compare the observed average value over the 38 TFs (red line) to the distribution of the average over the 1000 different randomizations. No randomization led to an average t-statistic larger than the observed one (P<0.001).

Fig.S7: Monte-Carlo randomization analysis in the TCGA LSCC RNA-Seq set.
For each of the 38 TFs in LungNet we performed 1000 distinct randomizations, whereby its gene targets were randomized among all possible non-TF genes. For each randomization, we recomputed a t-statistic of differential activity between LSCC and normal-adjacent tissue from the TCGA LSCC RNA-Seq set [5]. Boxplot compares the t-statistics of differential activity for the observed (ie unpermuted case, red) against the average over the 1000 randomizations (grey). P-values are from a Wilcoxon rank sum test: from left to right, the P-value testing that the observed t-statistics are lower than 0, the P-value testing that the observed t-statistics are lower than the ones for the permuted case, and the P-value for testing that the t-statistics from the randomized case are lower than 0. The density curves in the right panel compare the observed average value over the 38 TFs (red line) to the distribution of the average over the 1000 different randomizations. No randomization led to an average t-statistic lower than the observed one (P<0.001).

Fig.S8: Preferential inactivation of lung-specific TFs in LSCC and LUAD. Left panels:
Boxplots of t-statistics of differential TF binding activity between cancer and normal-adjacent tissue for 6 different TCGA cancer types. TF binding activity was estimated using SEPIRA on the RNA-Seq data. P-values are from a one-tailed Wilcoxon rank sum test. Right panels: As left panel but now combining the non-lung cancer types together in one group, and P-values reflecting whether the t-statistics of differential activity are lower in the lung cancers compared to all other cancer types. Top row is for the case where SEPIRA was applied to the RNA-Seq data without prior z-score normalization. Bottom row for the case where z-score normalization was performed on the RNA-Seq data before applying SEPIRA.

D)
Scatterplot of t-statistics from a regression of TF-activity against disease stage (x-axis) against their significance level (-log10P, y-axis). E) Boxplots of the t-statistics of differential activity between each disease stage and normal lung. P-values are from a one-tailed Wilcoxon rank sum test, testing that the distribution of the differential activity values is less than 0. Fig.S10: Contrasting dynamics of differential expression of AHRR and AHR in lung cancer. A) Boxplot of mRNA expression of AHRR in normal lung tissue of ex-smokers and current smokers. P-value is from a one-tailed Wilcoxon rank sum test. B) Boxplots of mRNA expression of AHRR across all major histological stages of lung carcinogenesis (N=normal, H=hyperplasia, M=metaplasia, D=dysplasia, LCIS=lung carcinoma in-situ, ILC=invasive lung cancer). P-values are derived from a one-tailed Wilcoxon rank sum test comparing each stage to normal lung (N). C) Boxplots of mRNA expression of AHRR between normal-adjacent and lung squamous cell carcinoma samples of the TCGA. P-value is from a one-tailed Wilcoxon rank sum test. D-F) As A-C), but now for AHR, the TF which is a target of AHRR. TFEC  22797  33 33  0  TBX2  6909  18 14  4  FOXA2  3170  15 15  0  TAL1  6886  19 19  0  TBX4  9496  16