MBASED: allele-specific expression detection in cancer tissues and cell lines
© Mayba et al.;licensee BioMed Central 2014
Received: 31 March 2014
Accepted: 7 August 2014
Published: 7 August 2014
Skip to main content
© Mayba et al.;licensee BioMed Central 2014
Received: 31 March 2014
Accepted: 7 August 2014
Published: 7 August 2014
Allele-specific gene expression, ASE, is an important aspect of gene regulation. We developed a novel method MBASED, meta-analysis based allele-specific expression detection for ASE detection using RNA-seq data that aggregates information across multiple single nucleotide variation loci to obtain a gene-level measure of ASE, even when prior phasing information is unavailable. MBASED is capable of one-sample and two-sample analyses and performs well in simulations. We applied MBASED to a panel of cancer cell lines and paired tumor-normal tissue samples, and observed extensive ASE in cancer, but not normal, samples, mainly driven by genomic copy number alterations.
Transcriptional activity at the different alleles of a gene in a non-haploid genome can differ considerably. Both genetic and epigenetic determinants govern this allele-specific expression (ASE)  and impairment of this highly regulated process can lead to disease . To understand the biological role of ASE and its underlying mechanisms, a comprehensive identification of ASE events is required. Recent advances in sequencing technology enable investigation of entire genomes at increasingly fine resolution. Whole exome DNA sequencing (WES) or whole genome DNA sequencing (WGS) allows identification of single nucleotide mutations or polymorphisms in all exonic regions or the entire human genome, respectively, while messenger RNA sequencing (RNA-Seq) enables quantitative analysis of gene expression. The expression state of the heterozygous loci detected in WES or WGS assays can be investigated in a matched RNA-Seq sample from the same individual, leading to a detailed map of the ASE activity. This approach allows the investigator to uncover the instances of complete or near allele silencing, which would be impossible using only RNA-Seq data.
Next-generation sequencing of short reads is prone to technical biases, for example, over- or under-representation of certain sequence motifs or inhomogeneous mapping, which must be overcome for effective ASE detection [3-5]. In addition, data from multiple heterozygous single nucleotide variants (SNVs) in the same gene must be integrated, and the large number of tested genes requires appropriate statistical treatment of the multiplicity of tested hypotheses. Despite these obstacles, next-generation sequencing technology has been recently used to identify putative sites of ASE within and between samples [4,6-14]. Previous work using short reads to detect ASE focused either on model organisms [11,13] or on normal human tissues or cell lines [4,10,12], although limited studies have explored the ASE landscape in cancer [15,16]. Further, there is currently no standard and robust way to aggregate information across SNVs into a single measure of ASE for an entire transcript isoform or gene. Most published studies either tested ASE at the SNV-level, sometimes requiring agreement across SNVs within a gene [3,6,7,10,12,17,18], or used available phasing information to sum reads across SNVs . A recent study  incorporated phased SNV-level information into a gene-level statistical model, allowing for extra variability due to alternative splicing effects on allelic ratios at individual SNVs. However, with the exception of limited samples such as those from the HapMap Project , most specimens do not have SNV phasing information. In some cases, population genetics-based approaches and existing databases can be used to phase common single nucleotide polymorphisms (SNPs) . However, the ability to phase common SNPs into individual haplotypes, whether based on previous knowledge or a statistical method, does not apply to somatic mutations in cancer. This makes it challenging to assign the ASE status to the mutant allele and reduces the ability to study the ASE of mutation-carrying genes.
To overcome these difficulties, we developed a novel ASE detection method, called MBASED. MBASED assesses ASE by combining information across individual heterozygous SNVs within a gene without requiring a priori knowledge of haplotype phasing; therefore, it can be applied to a wide array of existing RNA-Seq data sets, most of which do not have phasing information available. When phasing information is present, MBASED takes advantage of it to increase the power of ASE detection. In practice, even with modest sequencing depths, a large number of genes show more than one detectable heterozygous exonic SNV in RNA-Seq data, highlighting the importance of having a framework for aggregating expression information across individual loci.
To robustly estimate gene-level ASE from SNV-level RNA-Seq read counts, MBASED employs a meta-analytic approach , used originally to combine information from several studies into a global effect estimate. Our approach can be used in both one-sample and two-sample analyses, making MBASED a versatile tool for investigating allele-specific expression, both within an individual sample and in the context of differential ASE.
We applied MBASED to a panel of human lung cancer cell lines and paired tumor-normal lung and liver tissue samples. None of our samples had haplotype phasing information available, exemplifying a typical situation in gene expression studies. Our goal was to investigate the landscape of ASE in cancer and to identify potential instances of ASE contributing to cancer phenotypes. Previous studies of ASE in cancer were limited by sample size  (three paired tumor-normal samples) or concentrated on detecting monoallelic expression in the context of loss of heterozygosity events . In this study we present a general view of ASE, monoallelic or otherwise, in a panel of 25 cancer samples across 2 tissue types, including direct tumor/normal comparisons. We observed high rates of ASE (9 to 26%) in tumor tissue samples relative to normal tissue samples (0.5 to 2%), as well as variable ASE rates in cancer cell lines (1 to 31%). We found the observed elevated ASE rates in cancer samples to be mainly driven by underlying changes in genomic copy number and allelic composition. Numerous instances of genes with recurrent ASE in cancer were attributed to recurrent genomic alterations involving known cancer genes, for example, TP53 and KRAS. We found a number of mutations with known or suspected roles in cancer, including L858R mutation in EGFR and several G12A and G12C mutations in KRAS, to exhibit overexpression of the mutant allele, highlighting potential ASE contribution to cancer phenotypes. Joint analysis of tumors and matched normal samples did not reveal any instances of loss of imprinting, although several instances of loss of ASE in tumor were observed, including a switch of the overexpressed allele in the mono-allelically expressed pro-apoptotic factor BCL2L10. Our comprehensive analysis revealed a rich landscape of ASE in cancer and highlighted the flexibility and usefulness of our proposed method MBASED for ASE detection.
First, we give an overview of our method, MBASED, with detailed descriptions provided in Materials and methods and in Supplementary methods in Additional file 1. Given RNA read counts supporting reference and alternative alleles at individual SNVs within a unit of expression, MBASED provides an estimate of ASE and a corresponding P-value. A unit of expression can be a gene, a transcript isoform, an exon, or an individual SNV: MBASED is agnostic with respect to the nature of the unit provided by the user. In this work, we choose the gene as a unit of ASE, which we define as the union of all exons forming individual transcript isoforms.
In two-sample ASE analysis, the goal is to detect differential allelic imbalance between paired samples from the same individual. MBASED treats this problem in an asymmetric way, by designating one of the two samples as the sample of interest, for example, tumor sample in a tumor versus normal comparison. If true haplotypes are unknown, then for any gene that exhibits tumor-specific ASE, only the tumor read counts are informative for separating haplotypes into ‘major’ and ‘minor’. In such cases, MBASED phases alleles at individual SNVs into two haplotypes based exclusively on the sample of interest (tumor, in this case). If normal-specific ASE is under study, for example, when investigating loss of imprinting, then the normal sample can be designated as the sample of interest. Differences between ‘major’ allele frequencies at individual SNVs in the two samples are used as measures of between-sample ASE. SNV-level scores are combined into a gene-level score using meta-analysis, analogous to the one-sample approach. This composite score provides an estimate of gene-level MAF difference between the samples. As in our one-sample approach, internal simulations are used to assign statistical significance to the observed allelic imbalance, in cases of uknown true haplotypes.
We adopt the approach of DerSimonian and Laird  in establishing a meta-analysis framework for combining information across SNVs. This approach views the true unobserved treatment effect (in our case, ASE) at each observational unit of a common phenomenon (SNVs of a gene) as a random variable with a common mean. The estimate of that mean is obtained by combining information across individual units and represents a measure of the global effect (gene-level ASE). Within this framework, MBASED also reports the P-value corresponding to the constancy of the treatment effect statistic, Q, for multi-SNV genes in both one-sample and two-sample analyses. Q measures the observed extent of inter-SNV variability of ASE within a single gene (heterogeneity). The small reported P-values indicate genes with individual SNVs showing significantly inconsistent estimates of ASE. Such patterns can arise due to differences in ASE between various transcript isoforms of a gene , and therefore MBASED provides metrics for assessing the extent of isoform-specificity of the observed gene-level ASE.
Situations where one allele is favored in the read count data, even in absence of underlying ASE, have been reported in RNA-Seq data, due, for instance, to enrichment protocols, technological artifacts or a choice of a short read aligner [3,4]. We refer to such cases as instances of pre-existing allelic bias. When supplied with the values of probabilities of observing each allele at individual SNVs under conditions of no ASE, MBASED can incorporate such pre-existing biases into its estimates of ASE (Supplementary methods in Additional file 1), and we further provide functionality to estimate these probabilities from the data set itself. Our algorithm is implemented in the R  package MBASED. Further details are found in Materials and methods, Supplementary methods in Additional file 1, and the package vignette.
To demonstrate the performance of MBASED in the absence of phasing information, we analyzed multiple sets of simulated data, in which artificially introduced allele-specific expression patterns were assigned to different genes at various allele preferences and expression levels. We selected a pair of matched tumor-normal samples from our panel (HCC individual 2) and recorded all of the detected exonic heterozygous SNVs in both samples, retaining information about the total RNA coverage of each SNV, while discarding the observed reference and alternative allele counts in each sample. We chose a real data set as the basis for our simulations to ensure that the simulated data sets had realistic distributions of both the number of heterozygous SNVs per gene and the read coverage per SNV.
We found that the true positive rate (TPR) increased with read coverage and underlying ASE strength (MAF), as well as with the number of SNVs in a gene. We controlled the overall false discovery rate (FDR) at the nominal level of 5%, indicating that the P-value adjustment was effective. MBASED performed well even in low information settings. For example, >90% of ASE TP genes with 2 SNVs and 20 to 30 reads/SNV were recovered in simulations with MAF = 0.8. In analyzing real data, we required that a gene exhibit an estimated MAF ≥0.7 in addition to passing the statistical significance cutoff in order to be declared as exhibiting ASE (Materials and methods). As expected, this additional effect size cutoff reduced the TPR drastically for underlying ASE strength MAF = 0.7 (overall TPR fell from 55% to 37%), but had no appreciable effect on the TPR for higher values of MAF (data not shown).
MBASED employs beta-binomial distribution to model read count data (Materials and methods), which accounts for extra-binomial variability (overdispersion) often observed in allelic counts in RNA-Seq data sets [4,13,14]. We used the levels of overdispersion similar to those observed in real data (Materials and methods) while performing simulations, and note that MBASED performance improves as the amount of overdispersion decreases and the separation between signal and noise distributions of test statistic increases (Figures S1 to S4 in Additional file 1).
We also tested the performance of MBASED in the setting of pre-existing allelic bias, by assuming that at each SNV under conditions of no ASE the probability of observing reference allele count, P ref , was >0.5 (global reference bias). We found the results to be very close to those observed in the no-bias simulations (Figures S5 and S6 in Additional file 1). We conclude that the MBASED method is robust in detecting ASE genes in samples with unknown true haplotypes, with detection power increasing with observed gene coverage and the number of detected heterozygous SNVs in a gene.
We further assessed the performance of MBASED in a situation where the true underlying haplotypes are known. We obtained previously published lymphoblastoid cell line RNA-Seq data and a list of phased genomic variants for (non-cancer) individual NA12878, genotyped together with both parents as part of the 1000 Genomes Project . We pre-processed the data analogously to other samples in our panel (Materials and methods) and applied MBASED both with (‘phased’) and without (‘non-phased’) specifying the true haplotypes. Overall, we tested 2,560 genes for ASE, including 1,104 (40%) with >1 heterozygous loci. Using the cutoffs of 0.7 on estimated MAF and 0.05 on adjusted P-value, MBASED found 110 genes to show ASE in the ‘phased’ setting and 115 genes in the ‘non-phased’ setting, of which 108 were in common, indicating a high degree of consistency (Figure S7 in Additional file 1; Additional file 2). The small number of observed discrepancies was due to higher power of MBASED to detect ASE in general, and isoform-specific ASE in particular, when true haplotypes are known. A detailed discussion of the observed differences between running MBASED with and without prior knowledge of true haplotypes is provided in the Supplementary discussion in Additional file 1. We further note that running MBASED without supplying the true haplotypes resulted in correct haplotype reconstruction of 40/47 (85%) ASE genes with multiple SNVs. Further investigation revealed that of seven instances where haplotype reconstruction failed, six were likely due to alignment artifacts (Supplementary discussion in Additional file 1).
Finally, we compared the performance of MBASED with that of Skelly et al. , which is to our knowledge the only currently published ASE detection method that allows for variable ASE within a gene. Since the method of Skelly et al. requires that the true haplotypes be known, we used NA12878 RNA-Seq data for this comparison and supplied the true haplotypes to MBASED (Materials and methods). The method of Skelly et al. identified 103 ASE genes (posterior P(ASE) >0.95, posterior median MAF ≥0.7), compared to 110 identified by MBASED, including 94 that were common to both methods (Additional file 2). Of the nine genes identified as ASE by the method of Skelly et al. only, all have estimated MBASED MAF ≥0.8, but fall short of the significance cutoff due to low read coverage (10 to 12 reads/SNV, MBASED ASE P-values 0.05 to 0.17). Of the 16 genes identified as ASE by MBASED only, 15 show posterior P(ASE) >0.95 according to the method of Skelly et al., with posterior median MAF values of 0.58 to 0.7. The lower MAF estimates of Skelly et al. are due to its no-ASE prior imposed on the data. A detailed discussion of the observed differences between MBASED and the method of Skelly et al. is provided in the Supplementary discussion in Additional file 1. We conclude that the two methods produce qualitatively and quantitatively similar results on this data set. We note, however, that MBASED can perform in situations when the true haplotypes are unknown, a major advantage over the method of Skelly et al. In addition, MBASED allows for the effects of pre-existing allelic bias and disambiguates the technical and biological contributions to overdispersion in the data (Materials and methods), while the method of Skelly et al. combines the two.
We applied the MBASED method to a panel consisting of 18 lung cancer cell lines, 3 non-small cell lung cancer (NSCLC) tumor tissue samples, 4 hepatocellular carcinoma (HCC) tumor tissue samples, and 7 matched normal samples for the tumor tissues (Table S1 in Additional file 3) for a total of 25 cancer (21 lung and 4 liver) and 7 normal samples. None of the samples in the panel had known haplotypes. One-sample MBASED analysis was performed for each of the 32 samples and two-sample analysis was performed for tumor/normal and normal/tumor comparisons of 7 paired samples. Within each sample (or a pair of samples for two-sample analysis) only the genes containing informative heterozygous SNVs were tested for ASE (Materials and methods). Any gene with a BH adjusted MBASED P-value ≤0.05 and estimated MAF ≥0.7 was declared as exhibiting ASE in one-sample analysis. Similarly, any gene with a BH adjusted MBASED P-value ≤0.05 and estimated MAF difference ≥0.2 was declared as exhibiting sample-of-interest-specific ASE in two-sample analysis. This assignment provided one way of determining a set of genes in which to further characterize ASE in downstream analysis. All genes with adjusted inter-SNV ASE variability P-value ≤0.05 were flagged as possibly subject to isoform-specific ASE effects. Further details of the analysis pipeline are provided in Materials and methods, and the full results of MBASED application to the samples in our panel are available in Additional files 4 and 5. We note that the power of MBASED to detect mild levels of ASE is limited in the low coverage setting (right panels of Figures 2 and 3), common in our data, and the ASE levels reported here likely underestimate the true extent of ASE in samples under study.
Of genes that exhibited ASE in the one-sample analysis of tumors and that also were tested for ASE in the two-sample analysis, 48 to 77% showed tumor-specific ASE (Figure 4C). By comparison, a much smaller fraction of genes showing ASE in one-sample analysis of normal samples were found to show normal-specific ASE (3 to 32%), despite higher RNA-Seq coverage of the normal sample in five out of seven sample pairs (Table S1 in Additional file 3). This indicates that while most of ASE observed in normal samples is retained in the tumor, a large fraction of the ASE observed in the tumors has developed during the tumorigenesis process.
Across our 32 samples, we found that in one-sample MBASED analysis 22 out of 2,080 ASE genes with multiple heterozygous SNVs showed evidence of isoform-specific ASE. We note that the significance test based on heterogeneity statistic Q has lower power in the settings of low read coverage and few SNVs, common in our data, and we likely underestimated the extent of isoform-specific ASE. Since 20 of these genes were found in the liver samples (7 in the normal, 13 in the tumor), there might be more isoform-specific ASE occurring in the liver, although none of these genes exhibited liver-specific expression. Alternatively, it is possible that we were hindered in our detection of isoform-specific ASE by the low sequencing depth, since liver samples had the highest RNA-Seq coverage in our data set. In the two-sample MBASED analysis, 16 out of 701 ASE genes with multiple heterozygous SNVs showed evidence of isoform-specific ASE, including 12 in the liver samples (11 in the normal, 1 in the tumor). The biological significance of the observed instances of isoform-specific ASE is unclear and is further complicated by the observation that 10 out of 22 genes with one-sample isoform-specific ASE and 5 out of 16 genes with two-sample isoform-specific ASE were represented by only one transcript isoform. This observation may be due to incompleteness of the current set of gene models or to the variance of SNV-level measures of ASE in those genes exceeding what is allowed by our statistical model.
Overall, the normal samples exhibited limited extent of ASE, using our chosen cutoffs, while the cancer samples showed much higher ASE rates, with isoform-specific ASE playing a limited role, if any.
Based on these and similar observations, we conclude that the instances of recurrent ASE in our cancer samples were often driven by recurrent modifications of the underlying genomic CN state, affecting known driver genes in some cases.
Some instances of overexpressed functional mutant alleles
P -value (adj.)
G12C (NSCLC ind. 1)
Reported in COSMIC, known activating mutations
Falls short of significance cutoffs
Q787K (NSCLC ind. 2)
Both mutations in same sample, on same haplotype. L858R is reported in COSMIC
L858R (NSCLC ind. 2)
Q59H (HCC ind. 1)
Mutations in this motif results in hypersensitivity to mutagens
S207F (HCC ind. 1)
Tumor suppressor (inducer of apoptosis via p53)
Mutated domain is a binding site for MTOR inhibitor
K1248N (NSCLC ind. 2)
Outside of CN gain/loss or AI regions
R426C (HCC ind. 1)
R136H (HCC ind. 2)
T319I (HCC ind. 4)
We observed five instances of functional mutations that alter codon 12 of KRAS, a known oncogenesis-driving event . In three out of five cases, the mutant allele was significantly over-represented, while in another instance (NSCLC cell line H2009) the over-representation was borderline significant (MAF = 0.66, BH adjusted P-value = 0.1). This suggests the selective pressure to produce a large number of constitutively activated forms of KRAS. We also observed the over-representation of known activating mutation L858R in the kinase domain of EGFR (as well as a novel mutation in the same domain in the same individual), and a mutation in the FAT domain of mTOR, a major regulator of cell-signaling pathways. The FAT domain is a binding site for the mTOR inhibitor DEPTOR , suggesting that this mutation might also be constitutively activating. The potential instances of overexpressed inactivating mutant alleles include a mutation in the transactivation domain of the tumor suppressor EAF2 , and a mutation in the ring-finger motif of gene RAD18, which is involved in post-replication DNA damage repair .
Out of 41 instances of functional mutations with the mutant allele in the major haplotype of ASE-exhibiting genes, five fell outside of regions of genomic copy number change and/or allelic imbalance (Table 1, last 5 rows), including mutations in cancer-related genes MYH9, TIMP1, and FAS. However, it is unclear what the exact consequences of these mutations were for protein functionality, and what advantage to tumorigenesis, if any, was conferred by overexpression of the mutant allele.
The overexpression of mutant alleles that might confer some advantage to tumor cells was not universal. We found a small number of examples of functional mutant alleles that were expected to contribute to tumor phenotype but were not overexpressed. For example, cell line H441 contained a mutation in codon 12 of KRAS, but unlike the other four instances of this mutation (Table 1), this mutant allele was under-represented relative to the wild-type allele. We also found two instances of known activating mutations in residue 61 of NRAS , with no evidence of ASE in one case and strong evidence for overexpression of the wild-type allele in the other.
In summary, we found multiple examples of ASE where the overexpressed allele was either a known or suspected activating mutation in an oncogene or an inactivating mutation in a tumor suppressor gene. In almost all such cases the observed ASE arose from underlying DNA CN alterations. We observed some instances where the mutations expected to contribute to the cancer phenotype were underexpressed. It is possible that in such cases the mutation was crucial to early oncogenic processes, but that at later stages of tumor evolution the dependence of the cells on the mutant form of the protein was reduced.
We observed 89 instances of monoallelic expression (ASE with estimated MAF >0.9) in 74 genes across 7 normal samples in our panel. In 36 of these instances (corresponding to 28 genes) the matched tumor sample contained at least one common informative SNV in that gene, enabling us to test these genes for tumor-normal allelic imbalance using the two-sample MBASED approach. We found that in 5 out of 36 tested cases (13.9%), the observed monoallelic ASE (MAE) was specific to a normal sample and was lost in a tumor (Figure S10A in Additional file 1), but there were no instances of recurrent loss of MAE in tumor tissue samples. One example was gene ABP1 in HCC individual 4 (Figure S10B in Additional file 1). A previously described translocation event adjacent to ABP1 in this sample  might be a contributing factor to the observed loss of MAE. We also found one instance of an MAE pattern reversed between normal and tumor, in the BCL2L10 gene in HCC individual 4 (Figure S10C in Additional file 1). BCL2L10 encodes a pro-apoptotic factor and has been implicated in 5-azacytidine resistance in acute myeoloid leukemia and myelodysplastic syndrome patients . The two alleles differ by a pair of SNVs in a 3’ UTR, but it is not clear if the observed switch of ASE pattern was due to differential functional efficiency of the two alleles or if one of the alleles was more oncogenic.
Extending this analysis to all instances of ASE in normal samples, including non-monoallelic, we found 161 cases of ASE in normal samples that could be tested for tumor-normal allelic imbalance. In 30 (18.6%) cases, the observed ASE was normal-specific, with 16 (53.3%) such instances not attributable to underlying CN alterations, including four out of five loss-of-MAE cases. The extended analysis also did not reveal any genes with recurrent normal-specific ASE.
Loss of imprinting has been previously reported in cancer . We cross-referenced a list of known imprinted genes  against the list of genes with loss of ASE in tumor samples, and found no instances of loss of imprinting (Figure S10A in Additional file 1). In general, we found that out of 55 known imprinted genes, only 7 could be tested for ASE in 3 or more normal samples. We found that two of these seven genes (FAM50B and NDN) showed monoallelic expression in all tested instances, while the other five genes (GNAS, IGF2, NAA60, SLC22A18, and SLC22A3) did not show any evidence of ASE. These observed patterns could be due to the previously reported tissue-specificity of imprinting .
In addition to imprinting, another known source of ASE is chromosome X inactivation in female cells. We found that all but one of our nine female samples showed much higher rates of ASE on chromosome X than in the rest of the genome (Figure S11 in Additional file 1; Fisher exact test P-value <0.02 for chromosome X versus autosomal ASE rate comparison for all eight samples). The sole exception was a female cell line, H2009, that suffered a loss of a copy of chromosome X and exhibited no ASE on that chromosome. We found that the rates of ASE in chromosome X genes in the two normal female tissue samples were low (<8%), consistent with the existence of several clonal lines in each sample, with different copies of the chromosome inactivated in different clones . On the other hand, all female cancer samples (after excluding H2009) showed high rates of ASE on chromosome X (54 to 100% of tested genes). In some cases, including both female tumor tissue samples, this elevated rate could be attributed to underlying CN alterations. However, two of the cell lines did not show any CN changes or AI on chromosome X (data not shown), suggesting that a monoclonal expansion took place in these samples, giving rise to cell mixture, where one copy of chromosome X was preferentially silenced .
Overall, we observed a moderate extent of loss of normal ASE in tumors, with approximately 15% of normal ASE genes being normal-specific. We did not find any instances of recurrent loss of ASE and we also did not detect any instances of loss of imprinting. The observed loss of ASE did not appear to be driven by the underlying CN alterations, although the exact mechanism and biological significance of this process remain unclear. On the other hand, we observed elevated rates of ASE on chromosome X in cancer samples, occasionally accompanied by underlying genomic allelic imbalance. These latter cases might be due to the previously described high extent of chromosome X inactivation following monoclonal expansion. Our analysis was limited by a small sample size and low sequencing coverage. Larger-scale studies are needed to investigate these issues further.
We developed a novel method, MBASED, for the detection of allele-specific gene expression, both in a single-sample analysis setting and in the context of two-sample comparison (differential ASE). MBASED leverages all available information to determine the extent of ASE in a given gene by combining evidence across SNVs within a gene using a meta-analysis-based approach. In our study, a high fraction of genes showed evidence of more than one heterozygous expressed SNV, highlighting the importance of having an information aggregation framework. For the eight liver tissue samples we have examined, which had the highest levels of both WGS and RNA-Seq coverage among our samples, 45% to 55% of the genes were multi-SNV, and we expect these higher percentages to be typical of all deeply sequenced samples.
A main advantage of MBASED is that it does not rely on known phasing information and is therefore capable of using both SNPs and mutations in the analysis. While the haplotype reconstruction approach employed by MBASED is far from robust, the use of internal simulations allows the assignment of proper statistical significance to the resulting estimates of ASE. Using a sample with known haplotypes as a control, we find that running MBASED without supplying the haplotype information leads to correct haplotype recovery for 40/41 multi-SNV ASE genes. Further, out of 115 genes declared to exhibit ASE when haplotypes are withheld, only 7 are not supported when haplotype information is taken into account. Of these, six genes either show ASE just below our significance cutoffs or are cases of spurious ASE likely due to alignment artifacts. These observations indicate that lack of knowledge of true haplotypes leads to a very minor increase in type 1 error rate.
The framework of MBASED supports both within-sample and between-sample ASE analyses. The latter functionality allows the user to, for example, identify differential ASE in tumor/normal comparisons, or to detect instances of ASE not attributable to DNA copy number changes in RNA versus DNA allelic comparisons. The meta-analytic approach taken by MBASED also allows the user to detect instances of isoform-specific ASE. Since MBASED is agnostic with respect to the unit of expression, future studies might look at measuring ASE of individual transcripts directly. Finally, the algorithm is capable of handling pre-existing allelic biases (for example, global reference bias due to enrichment protocol or alignment technique that favors the reference allele), without sacrificing performance.
A potential limitation of MBASED is its assumption of exactly two haplotypes of a gene, but there might be rare situations of a chromosomal duplication of a variant-containing gene followed by a further mutation, giving rise to three distinct haplotypes. In this case MBASED will then attempt to resolve SNV-level information into two haplotypes. Another limitation is the reliance of MBASED on various approximations when incorporating pre-existing allelic bias and overdispersion into the model (Supplementary methods in Additional file 1), but this might not be suitable for extreme values of allelic bias or overdispersion levels. Finally, unlike the one-sample MBASED approach, the two-sample MBASED algorithm does not utilize a variance-stabilizing transformation prior to employing meta-analytic data-combining procedure (Supplementary methods in Additional file 1), exposing the overall estimate of ASE to potential influence of outliers. Further work is needed to properly address these limitations.
Applying MBASED to a combined panel of cell lines and paired tissue-normal samples, we found a large extent of ASE in cancer samples, driven primarily by underlying changes in DNA CN or composition (AI). As the result, the observed instances of recurrent ASE could often be attributed to recurrent genomic alterations.
The ability of MBASED to include somatic mutations in the ASE analysis allowed us to look in depth at the expression patterns of mutant alleles. We discovered evidence for significant preferential expression of the activating allele in known oncogenes (for example, KRAS and EGFR). In the case of KRAS, we observed significant mutant allele overexpression in three out of five mutated samples, with another sample showing borderline significance, suggesting that the overexpression might be the result of positive selection in tumor evolution. Our analysis shows that in almost every instance (36 out of 41), the observed significant overexpression of a mutant allele that was predicted to have functional consequences in the protein product was due to a CN change or an allelic imbalance event in the underlying genomic regions. This suggests a limited role for alternative ASE mechanisms (for example, pre- or post-transcriptional suppression of transcripts carrying wild-type alleles) as drivers of overexpression events giving rise to cancer phenotype. Further work is needed to clarify whether the findings reported here are a general feature of the cancer landscape, and what role ASE plays in a variety of cancer tissue types and indications.
A number of technical biases may give rise to false positives when the ASE state of transcriptomes is assayed with RNA-Seq [3,5,8]. In an initial analysis we discovered a large number of genes that showed recurrent ASE across multiple samples, including some genes with monoallelic ASE in 20 or more samples. We investigated those genes in more detail and discovered that we could attribute ASE recurrence to various artifacts, which we subsequently eliminated from the data (Materials and methods; Supplementary methods in Additional file 1). In some instances, we believe that the observed recurrent ASE might be due to the errors on the part of the aligner (for example, if there exists a known highly homologous region in the genome). However, in other cases we found evidence that the detected heterozygous variants in the gene were due to the presence in the sample genome of a homologous nonexpressed region that was absent from the reference genome. Since most of those variants are reported in dbSNP (v.132), an investigator might be led to believe that such a gene is imprinted or shows monoallelic expression in a cis-determined fashion. Thus, care needs to be taken in order to prevent the detection of spurious instances of ASE, which are likely to dominate any list of recurrent ASE events.
In addition to presenting a new method for ASE detection in both one-sample and two-sample analyses, the current study presents, to the best of our knowledge, the most comprehensive look at ASE patterns in cancer to date. As more samples across different cancer types become available, a comprehensive picture of the extent and role of ASE in oncological diseases will emerge. Simultaneously, the patterns and role of ASE and imprinting in normal tissues will be elucidated and will in turn shed light on why these would be disrupted in cancer. However, until the sequencing technologies mature to the point of allowing the investigator to obtain uninterrupted sequences of entire transcripts, ASE calling methods will continue to rely on aggregating information across several loci. Therefore, the many advantages presented by the MBASED algorithm make it well-suited for the current stage of ASE studies of both normal and tumor samples.
We performed WGS on 18 NSCLC cell lines and paired tumor-normal samples from 3 NSCLC patients, as well as 4 paired tumor-normal samples from 4 HCC patients, using Complete Genomics technologies , for a total of 7 normal and 25 cancer samples. All 32 samples have been previously published [31,37]. We performed RNA-Seq (75 or 76 bp paired-end) using Illumina GA-IIx for all samples, and the resulting short reads were aligned to the hg19 version of human genome using the GSNAP algorithm . All duplicate reads have been reduced to a single copy in order to avoid the detection of spurious ASE due to biases in PCR amplification steps of sequencing protocol. CN and AI information on the 18 cell lines was assayed with Illumina OMNI 2.5 M SNP array and processed with a modified version of PICNIC algorithm . This analysis produced integer estimates of total and major allele CN in the segmented genome. Any region with total CN >2 was declared to be a region of CN gain, while any region with total CN <2 was declared to be a region of CN loss. Similarly, any region with (Major allele CN)/(Total CN) > 0.5 was declared to be in the state of AI. In addition, CN status and AI information for the seven tumor tissue samples relative to the paired normal samples was inferred from WGS data using a dedicated pipeline, as previously described [31,37]. Detailed information about the samples is provided in Table S1 in Additional file 3.
For each sample, we obtained a list of called SNVs from the output of the Complete Genomics processing pipeline and tabulated the reference and alternative allele counts at each SNV from the aligned DNA and RNA reads. We eliminated potential homozygous SNVs by requiring that both the reference and alternative allele be supported by at least five WGS reads each as well as by at least 10% of all WGS reads aligned to that SNV. To avoid spurious SNV calls due to nearby indels , we also eliminated any SNVs falling within 10 bp of another variant. SNVs were assigned to RefSeq genes and only exonic SNVs were retained, with any SNVs falling into exons of more than one gene discarded. We further required that SNVs be covered by at least 10 reads in RNA-Seq data to ensure sufficient power to detect ASE and any excessive inter-loci variability. If multiple SNVs were overlapped by common WGS or RNA-Seq reads, they were merged into a single locus to ensure that the observed read counts at individual SNVs (loci) of a gene were independent, as required by the statistical model.
A number of authors have reported the existence of false positive ASE calls produced by various biases in short read aligners [3,4]. In order to filter out potential alignment artifacts we adopted some additional filters based on Self Chain alignments of the genome to itself [39-41], the reported frequency of structural variants in genomic regions provided in the Database of Genomic Variation [42,43], and on frequency of detected variants within a gene (see Supplementary methods in Additional file 1 for in-depth discussion of the filtering pipeline). The selfChain alignments and DGV variants were downloaded from the UCSC Genome Browser database  on 29 November 2012.
For tissue samples, somatic mutations in cancer samples were identified based on WGS using the CallDiff tool provided by Complete Genomics. For cell lines, a filter based on a large database of known common variants was applied to SNVs and any SNV passing the filter was declared ‘putative somatic’, as previously described .
Variant consequences were obtained for each SNV and each affected transcript with Variant Effect Predictor . Effect predictors SIFT  and PolyPhen  were used to identify deleterious variants among the SNVs. Any variant predicted to be deleterious by either SIFT or PolyPhen or predicted to result in stop codon gain/loss by Variant Effect Predictor was declared to have a possible effect on protein function (‘functional’). The affected protein domains were determined by consulting the UniProt database .
RNA-Seq data for NA12878 individual was obtained from the Gene Expression Omnibus (GEO) using accession number GSE30401 . Only FASTQ files corresponding to paired-end data were used. Since all sequencing was done on the same Illumina flow cell, the individual lane sequencing results were pooled. Phased genomic variants in hg19 coordinates were downloaded from . The data were aligned and filtered analogously to our own panel of samples (see Supplementary discussion in Additional file 1 for detailed description).
A set of known imprinted genes was downloaded from geneimprint.com  on 15 October 2012.
For example, if the alignment protocol is more likely to align reference-supporting read,
and provide the details of the full model in Supplementary methods in Additional file 1.
We employ the techniques of meta-analysis to combine information across SNVs within a single gene and derive a gene-level measurement of ASE. This approach is more robust than simply summing up the reads across individual haplotypes, since it takes into account the varying depth of coverage across SNVs and allows for adjustment due to potential local biases. The meta-analysis framework in its unmodified state presupposes the correct haplotype reconstruction for the validity of statistical inference. When the true haplotypes are unknown (as is the case in our data), we employ a ‘voting’ phasing algorithm. This pseudo-phasing produces the major and minor haplotypes by assigning SNV-level alleles with higher RNA read counts to the same ‘major’ haplotype and alleles with lower RNA read counts to the ‘minor’ haplotype.
A gene-level ‘average’ FT value, z hap1,g , is obtained as the inverse-variance-weighted average of SNV-level FT values z hap1,SNV . Under the null hypothesis of no ASE, we have p hap1,g = 0.5, and a nominal meta-analysis-derived P-value is assigned to observed z hap1,g , based on the known mean and variance of corresponding random variable Z hap1,g . This nominal P-value is not used, however, for reasons described below. Backtransformation  is used to obtain a gene-level estimate of the haplotype frequency (MAF), which we call T FT , from z hap1,g (Figure 1A). Further, we calculate inter-SNV ASE variability statistic Q, which measures the extent of heterogeneity between ASE measures z hap1,SNV at individual SNVs within gene g, and the corresponding P-value is reported by MBASED. Small heterogeneity P-values point to potential instances of isoform-specific ASE (note that our default model assumes no ASE differences among individual transcript isoforms, hence a single gene-level parameter p hap1,g ).
In the case where there is only one heterozygous SNV identified for a gene, we use an approximation to the two-sided binomial exact test. Briefly, we transform the major allele SNV count using Freeman-Tukey transformation into an FT value and then backtransform to obtain an estimate of MAF. This leads to a natural MAF estimate x maj,SNV /n SNV . Similar to the multi-SNV scenario, we treat this MAF estimate as a statistic and estimate its null distribution through simulations. With a large number of simulations, the resulting ASE P-value p g,ASE can become arbitrarily close to the P-value from the two-sided binomial exact test. Our motivation for adopting this alternative test was to ensure the consistency within our method with respect to single- and multi-SNV genes.
Using the same motivation, we also adopt a simulation-based approach to calculate P-values even if the phasing is known. In such instances, no additional phasing is done and major allele frequency is defined simply as the larger of the two observed allele frequencies.
The meta-analysis framework extends naturally to the two-sample comparison aiming to detect sample-specific ASE. We describe the procedure here in terms of comparing a ‘tumor’ sample to a ‘normal’ sample, but the comparison can be done for any two samples.
We now follow the procedure previously described for a one-sample analysis, and obtain gene-level PD value z hap1,g , which we use as an estimate of D and as a test statistic T PD (analogous to T FT in the one-sample case). We also calculate inter-SNV ASE variability statistic Q and use the resulting P-value to assess evidence for isoform-specific between-sample allelic imbalance.
MBASED modifies this procedure in order to deal with unphased data as follows. Only heterozygous SNVs detected in both samples are used to assess sample-specific ASE. For each gene, the SNVs are phased into ‘major’ and ‘minor’ haplotypes by the voting algorithm based exclusively on the data from the tumor sample. The phasing is the same for both samples to ensure that we are comparing the same haplotypes and is based on the tumor because it should be more informative in the cases of true tumor-specific ASE. We then estimate ASE using the meta-analysis procedure and treating ‘major’ haplotype as the tested ‘haplotype 1’. As before, simulations are employed to account for the bias introduced by the violations of binomial distribution assumptions due to phasing. Using these simulations, null distributions of T PD and Q are estimated, and the final ASE P-value p g,ASE and heterogeneity P-value p g,heterogeneity are obtained for each gene g. As part of this process, the underlying p hap1,g under the null hypothesis of no sample-specific ASE is estimated (Supplementary methods in Additional file 1) and used as (beta-)binomial probability of success in simulations.
Note that using MBASED to detect tumor-specific ASE may also uncover some instances of normal-specific ASE. This may happen if the haplotype phasing based on the tumor happens to recover the true underlying haplotypes and the tumor-derived ‘minor’ haplotype is strongly overexpressed in the normal sample. We stress, however, that the proper way to identify such genes is by treating the normal sample as the sample of interest.
The details of both one-sample and two-sample analyses as well as a detailed description of the modifications to this approach in presence of known (or estimated) overdispersion and pre-existing allelic bias are provided in Supplementary methods in Additional file 1.
We implemented the one- and two-sample analyses described above in the R package MBASED. In one-sample analysis, MBASED takes the reference and alternative allele RNA-Seq counts supplied by the user, as well as corresponding aseIDs (gene or transcript names) and haplotype assignments (if available) and reports for each aseID the estimated major haplotype frequency (MAF), the ASE P-value, and the inter-loci ASE variability P-value for multi-SNV genes. In two-sample analysis, MBASED takes the reference and alternative allele counts for both samples, as well as the corresponding aseIDs and reports the estimated MAF difference, the ASE P-value, and the inter-loci ASE variability P-value for multi-SNV genes. MBASED does not use any information about DNA (either WGS or WES) read counts, as ASE calls are based entirely on RNA data. The DNA-level data should be used during pre-processing steps to identify heterozygous exonic SNVs, according to the user’s criteria. If desired, a two-sample comparison can be performed by treating the RNA-Seq data as ‘tumor’ and DNA data as ‘normal’ to identify instances of allelic imbalance in the transcriptome not accounted for by the genomic allelic imbalance. However, this latter approach requires that both DNA and RNA-Seq data sets be produced under very similar conditions to avoid any systematic biases. Optionally, MBASED will estimate pre-existing allelic bias at each SNV from the data supplied by the user (or use directly supplied values) to properly adjust the reported effect size estimates and P-values. Similarly, user-supplied data can be used to estimate overdispersion at each SNV (or the values of the overdisperson parameter of beta-binomial distribution can be directly supplied). In our analysis we assumed that pre-existing allelic bias was the same at each SNV within a sample, and used a single estimate of the overdispersion parameter for all SNVs across all samples. Note that MBASED does not do any P-value adjustment for multiple hypothesis testing, and this task is left to the user. In the work presented here, we employed BH adjustment, but any other standard approach may be used. Further details are available in Supplementary methods in Additional file 1 and the package vignette.
Within each sample (or a pair of samples in the case of tumor-normal comparisons) only the genes containing at least one heterozygous exonic SNV passing our filtering criteria were tested for ASE. To declare a gene to be exhibiting ASE, we set the cutoffs on both statistical significance and the estimated extent of ASE (effect size). In one-sample analyses, we employed the BH procedure to adjust ASE P-values provided by MBASED within each sample for multiple hypothesis testing and require candidate ASE-exhibiting genes to have adjusted P-values ≤0.05 (FDR control of 5%). In practice, we encountered many instances of genes that showed statistically significant ASE but exhibited an allelic ratio insufficiently distinct from 1:1 to warrant biological significance. This was often the case for genes with high RNA-Seq coverage, where we were able to detect even small departures from equal allele expression. Therefore, we required that a candidate gene show estimated MAF ≥0.7 in order to be declared as exhibiting ASE.
In two-sample analyses, we required candidate ASE genes to have BH-adjusted ASE P-values ≤0.05 and to exhibit estimated MAF difference (Tumor - Normal) ≥ 0.2 (this is similar to requiring MAF to be ≥0.7 (=0.5 + 0.2) in the one-sample scenario). Additionally, we required genes exhibiting ASE in the two-sample analysis to also show evidence of ASE in the one-sample analysis.
In both one-sample and two-sample analyses, any genes showing significant evidence of inter-loci variability (BH-adjusted inter-loci ASE variability P-value ≤0.05) were flagged as possibly exhibiting isoform-specific ASE.
It should be stressed that ASE is not a binary characteristic but inherently presents as a spectrum within the transcriptome. Our assignment provided one way of determining a set of genes in which to further characterize ASE in downstream analyses, and we do not attach any strong biological meaning to our cutoffs.
We chose tumor/normal samples from HCC individual 2 as the basis for our simulations, since they showed comparable WGS and RNA-Seq coverages. For each sample we retained all SNVs that were tested for ASE in our main analysis.
We varied the value of MAF (major haplotype frequency), which measures the strength of ASE, from 0.7 to 0.9 across simulations, but kept it constant across SNVs within an individual simulation. We then ran MBASED on each simulated data set and declared any gene with an adjusted P-value ≤0.05 to exhibit ASE. We performed 100 simulations for each combination of f and MAF, and calculated average TPR and FDR within each stratum, as well as across all strata.
See Supplementary methods in Additional file 1 for discussion of the functional form of μ A in the presence of pre-existing allelic bias. We find the results to be very close to those observed in the no-bias simulations (Figures S5 to S6 in Additional file 1).
Finally, we tested the performance of MBASED at varying levels of overdispersion. We find that performance declines for high values of overdispersion (Figures S1 to S4 in Additional file 1), so care needs to be taken when applying MBASED to datasets with high noise levels.
The method of Skelly et al.  was run according to their tutorial. We followed the lead of the authors in using estimates a = 3,600 and d = 550 for the prior parameters. We used the posterior median of p as estimate of allele frequency and max(p, 1-p) as estimate of major allele frequency. We calculated the posterior P(ASE) as described in the tutorial. We declared any gene with posterior P(ASE) >0.95 and estimated MAF ≥0.7 as showing ASE. Detailed discussion of the comparison is provided in the Supplementary discussion in Additional file 1.
All 18 lung cancer cell lines and 3 paired tumor-normal NSCLC tissue samples have been previously published , and the sequencing data (WGS and RNA-Seq) are available from the NCBI database of Genotypes and Phenotypes (dbGaP)  under accession number phs000299. The SNP array data for lung cancer cell lines is available from NCBI GEO  under accession number GSE40908. The four paired tumor/normal HCC samples have been previously published  and the sequencing data (WGS and RNA-Seq) are available from dbGaP under accession number phs000384.
false discovery rate
Gene Expression Omnibus
major allele (haplotype) frequency
non-small cell lung cancer
single nucleotide polymorphism
single nucleotide variant
true positive rate
whole exome sequencing
whole genome sequencing
We thank Robert Gentleman for helpful discussion and support during project conception, Gerard Manning, Richard Bourgon and Thomas Sandmann for detailed manuscript review and critiques. We also thank Jens Reeder, Michael Lawrence, Matthew Brauer, Kiran Mukhyala, Christiaan Klijn and Florian Gnad for bioinformatics pipeline development and helpful discussions.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.