MBASED: allele-specific expression detection in cancer tissues and cell lines

Allele-specific gene expression, ASE, is an important aspect of gene regulation. We developed a novel method MBASED, meta-analysis based allele-specific expression detection for ASE detection using RNA-seq data that aggregates information across multiple single nucleotide variation loci to obtain a gene-level measure of ASE, even when prior phasing information is unavailable. MBASED is capable of one-sample and two-sample analyses and performs well in simulations. We applied MBASED to a panel of cancer cell lines and paired tumor-normal tissue samples, and observed extensive ASE in cancer, but not normal, samples, mainly driven by genomic copy number alterations. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0405-3) contains supplementary material, which is available to authorized users.


SNV Filtering
Numerous studies of ASE that utilized RNA-Seq data have reported spurious signals due to various technical biases[1, 2,3]. For example, early read-mapping algorithms tended to favor the alignment of reads containing reference allele, since such alignments incurred no mismatch penalties. While this particular bias can be overcome by SNP-tolerant alignment tools [3,4], other sources of bias are more difficult to account for. With regards to alignment procedures, some authors simulated all possible SNP-covering reads and aligned them to the reference in order to identify SNPs that can give rise to artificial ASE even under ideal conditions [1,5]. However, it is far from clear how this procedure can be extended to the paired-end protocols. Besides, as will be seen, this approach cannot filter out all sources of artificial ASE. Spurious ASE has also been reported for SNVs in close proximity to indels [2,6] and in lowcomplexity or repetitive regions [2,7]. In some instances investigators reported detected ASE at a location found to be represented by 3 distinct haplotypes [2], an obvious artifact.
We adopted a filtering pipeline designed to discard SNVs that are likely to contribute to spurious ASE. First, we eliminated potential homozygous SNVs by requiring that at least 5 reads and at least 10% of all reads support each allele in WGS data. Second, we discarded all SNVs within 10 bp of another called variant to eliminate possible false variant calls due to alignment. Third, we discarded all SNVs in highly repetitive genomic regions, which we defined as regions with sequence identity of >95% to another genomic region based on selfChain Link track from UCSC Genome Browser [8,9,10,11].
The above steps were designed to filter out SNVs that might give rise to artificial ASE due to inadequacies of alignment or variant-calling procedures. However, we discovered that some of the most prominent cases of spurious ASE could be attributed to incomplete reference genome rather than the problems with adopted data processing tools. Figure S13A shows 20 genes most frequently exhibiting ASE prior to filtering. Closer scrutiny revealed most of them to be artifactual. For example, one of the top most recurrently allele-specifically expressed gene MAP2K3 exhibited high density of heterozygous SNVs in WGS data in all samples, but only one of those SNVs had the alternative allele detected in RNA data ( Figure S14). A closer look revealed that there seemed to be 2 haplotypes of MAP2K3 supported by WGS: one corresponding to reference genome and another containing virtually all alternative alleles at the detected SNVs. We further observed that 36 out of 37 MAP2K3 SNVs in our data were found in dbSNP [12] (v.132). One explanation for this observation is all of our samples were simultaneously heterozygous for these two haplotypes. A more likely explanation is that all of our samples contained a nonexpressed region in their genome with homology to MAP2K3 that was absent from the reference genome, indicating a possible assembly error in the reference genome. It is likely then that the observed sequence differences between MAP2K3 and the putative homologous region were previously erroneously identified as SNPs within MAP2K3 sequence, explaining their presence in dbSNP. Another example of similar nature was provided by gene SEC22B ( Figure S15). Here we observed evidence for 3 distinct haplotypes at DNA level in all of our samples, only one of which is detected at RNA level.
Such artifacts will not be filtered out by the alignment of simulated reads, since the absence of the existing homologous regions from the reference genome will force false SNP identification. In fact, reference[1] reported SEC22B as one of their ASE genes after filtering out artifactual SNPs. It should be noted that these artifacts will also not be filtered out by using DNA counts for normalization, since haplotypes presented at RNA and DNA levels are distinct. We therefore attempted to remove such regions from the analysis based on two proxy measures. First, we quantified the extent of detected variants in a gene by calculating the number of variants per kb of exonic length (VPKE) for each gene in each sample. Within each sample, we identified genes with highest 2% VPKE values and removed them entirely from further analysis. Our reasoning was that genes exhibiting very high number of variants (normalized for exonic length) might correspond to unannotated homologous regions described above. Secondly, we removed any SNVs that fall into genomic segments that were found to contain structural variations in 5 or more different samples in Database of Genomic Variation [13,14,11]. The effect of our filtering schema on the 20 genes most frequently exhibiting ASE is shown in Figure S13B. Of the top 20 genes, only the known monoallelically expressed gene ERAP2 does not lose any of its SNVs to our filtering procedure. We believe that similar filtering pipelines will remain necessary in ASE studies to avoid spurious calls until all of the biases have been successfully addressed.

MBASED Details
Here we describe the statistical framework used by the MBASED (Meta-analysis-Based Allele-Specific Expression Detection) algorithm for identifying instances of allele-specific gene expression in RNA count data. In 1-sample analyses, MBASED identifies within-sample deviations from the null hypothesis of equal allele expression, while for 2-sample analyses, differences in allelic ratios between samples with the same haplotypes are detected. Two-sample comparisons are therefore meaningful for samples from the same individual (paired analysis), but not necessarily for samples from different individuals. MBASED uses principles of meta-analysis to combine information across loci within a single unit of expression (which we call an aseID, e.g., a gene, a transcript, or an individual SNV). The informative loci are usually heterozygous SNVs, but could also be aggregations of SNVs that have been merged into a single locus. In the latter case, the true underlying haplotypes of the locus with respect to constituent SNVs must be known or calculated (e.g., by a formal phasing algorithm or examining reads overlapping multiple SNV locations). In the following discussion, we will use 'gene' in place of aseID and 'SNV' in place of 'locus', as these are the units we typically use in practice.

Model assumptions and MBASED outline
We assume that for each tested gene in a sample (i.e., a gene with at least one heterozygous SNV), there are exactly two distinct haplotypes. For each gene g, one can model the detected haplotype 1 allele read counts X hap1,SN V at each SNV as independent F (n SN V , p hap1,g , θ SN V,g ) random variables, where • F is some choice of discrete probability distribution, e.g., binomial • n SN V is the total number of reads covering a particular SNV • p hap1,g is the true underlying frequency of haplotype 1 transcripts for that gene (with p hap2,g = 1 − p hap1,g ) • θ SN V,g denotes any additional parameters for specific model choices Under the null hypothesis of no ASE, the two haplotypes are expressed at the same rate, and therefore p hap1,g = p hap2,g = 0.5. Note that we explicitly assume that there is no difference in ASE between individual transcript isoforms of gene g (hence a single haplotype frequency parameter p hap1,g ). We address the issue of isoform-specific ASE detection in section 2.1.1. In practice, the assumption of independence can be grossly violated if two heterozygous SNVs are close enough to each other to be largely covered by the same set of reads. In such cases the two SNVs should ideally be merged into a single locus (we will continue to refer to such merged loci as 'SNV' for clarity of presentation).

Meta-analysis
In most studies to date, ASE testing was done at the level of individual SNVs, even if more than 1 heterozygous SNV per gene was present [15,1,16,17,18,19,7]. Most of the time, only the agreement of final SNV-level calls is observed, and no attempt is made to share information across SNVs or to assign weights to SNV-specific calls based on the calls' confidence (e.g., amount of read coverage). In some instances, when haplotypes are phased, the SNV-level counts are added up across haplotypes [2,5]. This, however, masks potential ASE heterogeneity due to alternative splicing.
We describe how meta-analysis techniques, following [20] (DerSimonian-Laird), can be employed to test a multi-SNV gene for ASE based on observed SNV-specific allele read counts. These techniques were initially developed to aggregate information across multiple studies of the same phenomenon. Study-specific measurements are aggregated by taking their inverse variance-weighted average to provide an overall estimate and a corresponding measure of confidence (e.g., a confidence interval or a p-value for deviation from some null value of overall effect).
MBASED adopts this approach to the problem of combining ASE information across individual SNVs into a gene-level measure of ASE. In the 1-sample analysis, we choose a haplotype (haplotype 1) and test whether it is over-(p hap1,g > 0. 5) or under-(p hap1,g < 0.5) represented relative to the other haplotype (haplotype 2) based on allelic counts at individual SNVs. In the 2-sample analysis, we test whether the haplotype frequencies are different between the two samples (p hap1,g,sample1 = p hap1,g,sample2 ). We report the estimate of ASE exhibited by the gene and the corresponding pvalue.
Furthermore, in the DerSimonian-Laird meta-analytic framework, a p-value based on a constancy test statistic Q measures the extent of heterogeneity of observed ASE at SNVs within a gene. These heterogeneity p-values are also reported by MBASED, with small p-values indicating different levels of underlying ASE at individual SNVs, which may be attributed to isoform-specific ASE (if the unit of ASE is a gene) or some other source of excessive variability in ASE measurements.
The meta-analysis framework in its unmodified state presupposes the correct haplotype reconstruction for the validity of statistical inference. However, in practice, we do not have phasing information for any of our samples, a situation typical of currently available RNA-Seq studies. We therefore designed MBASED to handle both datasets with and without phasing information.
When true haplotypes are unknown, one can, in principle, attempt to phase SNVs into haplotypes based on population databases and prior linkage disequilibrium information [21]. However, this approach is restricted to known SNPs and does not allow for easy phasing of mutations, particularly somatic mutations arising in tumor samples.
Instead, in abasence of haplotype information, we phase individual SNVs into gene haplotypes based on observed read counts. In such cases, we perform Monte Carlo simulations based on the null hypothesis of no ASE (p hap1,g = 0.5 in 1-sample analysis, p hap1,g,sample1 = p hap1,g,sample2 in 2-sample analysis) to assign statistical significance to the resulting measures of gene-level ASE. The final reported p-values are then based on the results of these simulations.

Generative statistical models for read counts
The binomial model is widely used for ASE detection in count data [1,17,2,19,22,7]. In the simplest version of the model, the detected haplotype 1 allele read counts are modeled as (1) random variables. Sometimes, pre-existing allelic bias is present, which results in over-representation of reads supporting a particular allele even under conditions of no ASE. In such cases, it is common to modify model (1) by incorporating the underlying bias into the generative binomial probability: We discuss the issue of pre-existing allelic bias and its incorporation into the model in detail in section 2.1.3. A number of authors have advocated the use of beta-binomial model to accomodate extrabinomial variability (overdispersion) observed in real RNA-Seq data. This additional variability can arise due to various technical biases [6], aggregation of data across individuals [5], or heterogeneity of ASE across SNVs in the same gene due to alternative splicing [23]. In the first two cases, the additional variability is a nuisance parameter to be accounted for, while in the last case it might provide evidence of isoform-specific ASE, a potentially interesting biological finding. MBASED follows the lead of these authors in adopting a beta-binomial model, and attempts to separate the two sources of overdispersion by estimating the inter-SNV ASE variability ('isoform effect') after accounting for overdispersion due to technical effects. This approach is different from either ignoring the distinction and treating the biological and technical overdispersion jointly [5,6], or assuming different dispersion parameters for genes with and without ASE [23]. In the latter case, the dispersion estimate in a gene with variable ASE is an aggregated measure of both technical and isoform effects, and does not distinguish between the two.
We say that X ∼ BetaBin(n, µ, ρ), if X ∼ Bin(n, Y ) and Y ∼ Beta(µ, ρ), where µ = E(Y ) ∈ (0, 1) and ρ = var(Y ) µ×(1−µ) ∈ (0, 1). In this parametrization, ρ is the overdispersion parameter, with the convention that ρ = 0 ⇒ P (Y = µ) = 1 (leading to X ∼ Bin(n, µ), standard binomial model). We then have In the ASE setting, one can model the detected haplotype 1 allele counts X hap1,SN V at each SNV within gene g as independent random variables with If ρ SN V = 0, the model reduces to the binomial model (1). When pre-existing allelic bias is present, model (3) can be modified as: This is the full statistical model supported by MBASED. By default, MBASED assumes that there is no pre-existing allelic bias and no overdispersion in the data, reducing model (4) to model (1). The user can supply the information on pre-existing allelic bias and overdispersion or allow MBASED to estimate it from the data, in order to take advantage of the full model (4). Some authors estimate the parameter ρ SN V within an individual gene (ρ SN V = ρ g ), using Bayesian prior distribution assumptions, and allowing ρ g to capture overdispersion due to both technical effects and isoform-spedific ASE [23]. Other authors let genes share the overdispersion parameter (ρ SN V = ρ), modeling ρ as capturing overdispersion due to technical effects, the extent of which is assumed to be constant across all loci within a sample [5,6].
In the model adopted by MBASED, ρ SN V represents nuisance overdispersion, due to technical effects only. In other words, the default MBASED model (4) assumes that there is no additional overdispersion due to biological events of interest, e.g., isoform-specific ASE. The evidence for the existence of this additional overdispersion is evaluated separately, as described below, allowing MBASED to separate the two sources of extra-binomial variability. In practice, MBASED requires the values ρ SN V , representing technical overdispersion, to be specified for each locus in the sample, and therefore cases of SNV-specific (no relationship between ρ SN V ), gene-specific (ρ SN V = ρ g ), and sample-specific (ρ SN V = ρ) modeling restrictions on ρ SN V are handled within the same framework. While MBASED provides functionality to estimate ρ (or ρ g , or ρ SN V ), it requires the user to supply the data to be used in such estimation. In other words, MBASED relies entirely on the user for the estimates of ρ SN V , and it is important that these estimates capture the entire extent of technical overdispersion. We assume in what follows that the appropriate values of ρ SN V have been supplied to MBASED, and we will describe in section 2.2.7 the estimation approach that we adopted in this study.

Pre-existing allelic bias
We discuss the pre-existing allelic bias in the context of binomial generative model, and the interpretation in the case of the beta-binomial model is entirely analogous. We introduce the notation f hap1,SN V (p hap1,g ) in specifying model (2), with 0 ≤ f hap1,SN V (p hap1,g ) ≤ 1, to highlight that the probability of observing a read supporting haplotype 1 at a given locus may not be the same as the underlying haplotype frequency p hap1,g . This may occur when the RNA enrichment protocol or the choice of alignment procedure results in allelic bias. For example, an alignment procedure may favor reads containing the reference allele[1]. In such instances the SNVs where the haplotype 1 allele is reference will exhibit f hap1,SN V (p hap1,g ) > p hap1,g , and, conversely, the SNVs where the haplotype 1 allele is alternative will exhibit f hap1,SN V (p hap1,g ) < p hap1,g . Note that the probability of observing a read supporting haplotype 2 is In the binomial model shown in (2), the probabilities of observing the two haplotype alleles under the conditions of no ASE are given by f hap1,SN V (0.5) and f hap2,SN V (0.5). Whenever f hap1,SN V (p hap1,g ) = p hap1,g , we say that pre-existing allelic bias is present in the data. When assessing deviations from null hypothesis of no ASE (p hap1,g = 0.5), most authors assume no pre-existing allelic bias, while others estimate f hap1,SN V (0.5), either through simulations[1], or from genomic DNA (where p hap1,g = p hap2,g = 0.5, under heterozygous diploid conditions) [22,23].
MBASED and the analysis presented in this study assume a particular functional form of f hap1,SN V . Namely, we assume that where • p hap1,g and p hap2,g are the true underlying frequencies of haplotype 1 and haplotype 2 transcripts, with p hap1,g + p hap2,g = 1 • f hap1,SN V (0.5) and f hap2,SN V (0.5) are underlying frequencies of observed haplotype 1 and halpotype 2 allele-supporting reads under the conditions of no ASE, with f hap1,SN V (0.5) + f hap2,SN V (0.5) = 1 from equation (5) • f hap1,SN V (p hap1,g ) and f hap2,SN V (p hap2,g ) are the true underlying frequencies of observed haplotype 1 and haplotype 2 allele-supporting reads We note that • f hap1,SN V (p hap1,g ) is proportional to both p hap1,g and f hap1,SN V (0.5) and is uniquely determined by these two frequencies, an approch similar to the one employed by [22] •

as required
• If f hap1,SN V (0.5) = 0.5 (no allelic bias under conditions of no ASE), then f hap1,SN V (p hap1,g ) = p hap1,g (no allelic bias under any ASE setting) To illustrate the logic behind our approach, consider a situation where an alignment procedure favors the reference allele over the alternative allele, with the result that, in the absence of ASE, we observe reference allele 60% of the time, on average. Suppose now that we have a situation of genuine ASE (say, p hap1,g = 0. 8), and that at the SNV of interest, the allele supporting haplotype 1 is reference. Then letting p ref = 0.6 and p alt = 1 − p ref = 0.4 denote the probabilities of observing the reference and alternative alleles, respectively, under conditions of no ASE, we have Similarly, if the allele supporting haplotype 1 at our SNV is alternative, we have Notice that we are more likely to observe the major allele if it is reference (p = 0.86) than if it is alternative (p = 0.73), even though the underlying true major haplotype frequency in the transcriptome is the same (0.8).
In practice, MBASED requires the values f hap1,SN V (0.5), representing pre-existing allelic bias, to be specified for each locus in the sample. While MBASED provides functionality to estimate f hap1,SN V (0.5), it requires the user to supply the data to be used in such estimation. We assume in what follows that the appropriate values of f hap1,SN V (0.5) have been supplied to MBASED, and describe in section 2.2.5 the estimation approach that we adopted in this study.

Use of DNA read counts in ASE analyses
A number of studies have used allele-specific genomic DNA (gDNA) counts to adjust RNA-based measures of ASE for any technical biases that may give rise to artifactual allelic imbalance. This practice is commonly employed in microarray studies, where both gDNA and RNA-derived cDNA molecules are hybridized to the same array type [18], but it can also be done in the sequencing setting [17,16,19,22,23]. In both cases, one has to assume that any biases introduced into the measures of ASE by the experimental procedure are shared by the DNA and RNA samples. In our case, DNA and RNA read counts were derived from different sequencing platforms (CGI and Illumina, respectively) and we do not believe that a direct comparison of RNA and DNA allelic ratios would be appropriate. However, one can easily compare RNA and gDNA samples directly using the 2-sample version of MBASED to specifically detect genes showing ASE in excess of what can be explained by genomic allelic imbalance alone. In this study we are interested in all instances of ASE in cancer samples, including the ones arising from copy number alterations, and therefore do not perform this comparison. In addition, some authors have used gDNA counts to estimate parameters of the assumed distribution of RNA counts. One example of this is using gDNA counts to estimate overdispersion parameter of the beta-binomial distribution [23]. This is most appropriate when RNA and DNA samples are sequenced and aligned using as similar protocols as possible. For this reason, while we do adopt a beta-binomial model, we do not employ gDNA to estimate overdispersion in the current study. However, MBASED allows the user to supply gDNA counts for this purpose, as described in the package vignette.

1-Sample Analysis
We assume that for gene g in the sample of interest, the number of reads supporting haplotype 1 at an exonic heterozygous SNV X hap1,SN V follows a beta-binomial model (4), and we denote the observed value of this random variable as x hap1,SN V . In 1-sample analysis, our measure of ASE is the major haplotype frequency (MAF) defined as p maj,g = max(p hap1,g , p hap2,g ), and we want to test the null hypothesis of no ASE H 0 : p maj,g = 0.5 For clarity of presentation, we first describe the MBASED algorithm under the assumption of binomial read counts (ρ SN V = 0) and of no pre-existing allelic bias (f SN V (p hap1,g ) = p hap1,g ), reducing the model to We then demonstrate the extensions to take pre-existing allelic bias into account (f SN V (p hap1,g ) = p hap1,g , section 2.2.4) and to include overdispersion (ρ SN V = 0, section 2.2.6).

Freeman-Tukey transformation
In our application of DerSimonian-Laird meta-analytic framework to the 1-sample ASE analysis, we largely follow the approach implemented in the function metaprop() in R package meta [24]. This implementation applies meta-analysis to binomial (but not beta-binomial) counts through the use of the variance-stabilizing Freeman-Tukey (FT) transformation [25], briefly described here. Let X ∼ Bin(n, p), and let x denote an observed value of X. Then is an approximately N 2 sin −1 ( √ p), 1 n+0.5 random variable, with an observed value z = F T (x, n) ∈ (0, π). Note that the (approximate) variance of Z does not depend on p, but the (approximate) mean does, illustrating the variance-stabilizing properties of the transformation. The companion backtransformation P = F T −1 (z, n) [26] takes z and the original total count n and produces the proportion 0 ≤ P ≤ 1. The backtransformation is exact in the sense that it gives back the observed proportion: F T −1 (F T (x, n), n) = x/n.

Multi-SNV genes
Consider a gene with several heterozygous exonic SNVs, and suppose that phasing information is available, i.e., for each SNV we know which allele belongs to haplotype 1 and which to haplotype 2. We choose a haplotype (say, haplotype 1) and convert the observed haplotype 1 read counts using the Freeman-Tukey transformation into corresponding FT-values, Under the assumptions of binomial model (8), the tranformed values z hap1,SN V are realizations of independent (approximately) normal random variables Z hap1,SN V , with In particular, under conditions of no ASE (p hap1,g = 0.5), we have E(Z hap1,SN V ) = π/2. Further, While the expressions for mean and variance of Z hap1,SN V given above are approximate, the approximation is known to be quite good as long as E(X hap1,SN V ) is not too close to 0 or n SN V (i.e., we expect at least few counts to support each allele).
We then compute a gene-level FT-value, z hap1,g , as where Thus, z hap1,g is the inverse variance-weighted mean of the z hap1,SN V 's, where the weights are normalized to add up to 1. Note that z hap1,g is the observed value of the random variable Since for each SNV E(Z hap1,SN V ) = 2 sin −1 ( √ p hap1,g ) and var(Z hap1, and var(Z hap1,g ) = 1 Note that the contribution of each individual observation (SNV) is weighted according to coverage (total number of reads), and this weight is independent of the ASE estimate at that SNV (proportion of reads supporting haplotype 1 allele).
We also calculate DerSimonian-Laird constancy statistic where w SN V is given in (14). Q (also known as Cochran's heterogeneity Q) measures heterogeneity among SNV-level ASE estimates, and under the hypothesis of homogeneity (p hap1,SN V = p hap1,g is constant across SNVs within gene g), the distribution of Q is approximately χ 2 with (#SNVs-1) degrees of freedom. The resulting heterogeneity p-value is recorded, with small p-values indicating evidence for isoform-specific ASE or some other source of excessive variability. We note that tests for heterogeneity based on Q have lower power in the common setting of few SNVs and low coverage, and the reader is warned that MBASED might be under-powered to detect instances of true heterogeneity in those situations.
To obtain the gene-level estimate of p hap1,g , we use the Freeman-Tukey backtransformation p hap1,g = F T −1 (z hap1,g , n g ) (19) where n g stands for some gene-level average of n SN V 's. In some rare cases (and depending on the choice of n g ) it may happen that z hap1,g < F T (0, n g ) or z hap1,g > F T (n g , n g ). In other words, the transformed value is outside of the range attainable for a given n g . One way this may happen is for one SNV to show very low coverage n SN V1 and very low frequency of haplotype 1-supporting reads (z hap1,SN V1 close to 0), while a second SNV shows very high coverage n SN V2 >> n SN V,1 and very high frequency of haplotype 1-supporting reads (z hap1,SN V2 close to π). In this scenario, it may be possible for the gene-level value z hap1,g to be so close to π as to be outside of the attainable range for n g , which may not be close enough to n SN V2 to ensure this does not happen. To ensure that the back-transformation is meaningful, we handle these two cases by assigning to z hap1,g the values of F T (0, n g ) and F T (n g , n g ), respectively. While the implementation of meta analysis in [24] follows the recommendation of [26] in using the harmonic mean of n SN V 's as the 'average' total count n g for backtransformation, we find that this gives too much weight to the loci with few read counts. Instead, we let n g be a weighted arithmetic mean of n SN V 's, with weights proportional to n SN V 's and adding up to 1. In practice we find that the differences are negligible except for a few extreme cases of SNVs within the same gene that have drastically different coverage. The gene-level estimate of MAF is obtained as Since under the null hypothesis of no ASE, Z hap1,g follows a normal distribution with known mean and variance, a z-test can be employed to assess the significance of deviation from the null. A p-value can then be assigned to the observed extent of ASE. This is the p-value reported by MBASED when phasing information is available for the analyzed sample.
In practice, we do not have haplotype phasing information for samples in our study and MBASED was designed to be able to deal with such data sets. It circumvents the problem by phasing alleles at individual SNVs within a gene into haplotypes by assigning all major allele counts to one haplotype (which we will call haplotype 1) and all minor allele counts to another haplotype (haplotype 2), a so-called 'voting' approach to phasing. In other words, at each SNV we assign reference allele to haplotype 1 if and to haplotype 2 otherwise. Note that by design ≥ 0.5. The estimated haplotype 1 may not correspond to either of the two true underlying gene haplotypes. However, we expect that in cases of actual ASE (p maj,g >> 0.5), our approach will in fact recover the true haplotypes, with haplotype 1 corresponding to the major of the two. We found this to be the case in our analysis of a sample with known haplotypes (secton 3). We also observed that in the simulations described in the main text, we successfully recovered true haplotypes for > 99.5% of ASE gene with ≥ 2 SNVs, when the MAF was set to 0.8, and > 96% of such genes when the MAF was set to 0.7 (data not shown). Even for a highly unfavorable case of small effect size (M AF = 0.7), large number of SNVs (4), and low coverage (10 − 20 reads/SNV), we recover more than 90% of true haplotypes.
We then proceed to apply the meta-analysis technique described above to haplotype 1 read counts, and derive the gene-level estimate T F T and the constancy statistic Q. However, the distributional assumptions on Z hap1,g under the null hypothesis of no ASE no longer hold, since, after phasing by MBASED, the observed counts x hap1,SN V are no longer the realizations of random variables X hap1,SN V ∼ Bin(n = n SN V , p = p hap1,g ) Instead, they are realizations of random variables It is easy to see that using nominal p-values produced by applying meta-analysis to our pseudophased haplotypes will result in anti-conservative behavior of the testing procedure, as each individual SNV shows overexpression of the same haplotype. We estimate the true sampling distribution of T F T by Monte Carlo simulations as follows. During each round of simulation, a 'reference' allele count x sim,ref,SN V is generated for each SNV in a gene from a Bin(n = n SN V , p = 0.5) distribution, and the 'alternative' allele count The resulting allele counts are then phased into haplotype 1 ('major') and haplotype 2 ('minor' ) using the voting approach, and the meta-analysis MAF estimate T sim,F T is produced. Repeating this process N sim times, an approximation to the null distribution of T F T under the hypothesis of no ASE is generated. Finally, the ASE p-value is calculated as We note that simulations can also, in principle, be performed in the case of known true gene haplotypes. With large number of simulations, the resulting p-values will approximate the nominal meta-analysis p-values to an arbitrary degree of precision. This is the approach we adopted when comparing the performance of MBASED with and without information about true haplotypes (section 3), in order to obtain p-values with the same level of granularity. We observed that the correlation between the simulated (10 6 simulation rounds) and nominal p-values for multi-SNV genes was > 0.999 (data not shown).
Similarly, we approximate the null distribution of heterogeneity statistic Q by recording the observed values of Q during each simulation, and calculate the heterogenity p-value to be

Single-SNV genes
For a single SNV under the conditions of no ASE, we have where X ref,SN V is the reference allele count. We use an approximation to a two-tailed binomial test to assess the extent of ASE in such genes. Formally, and analogously to the multi-SNV case, we take the observed major allele count x SN V and transform it to In practice, for single-SNV genes, the transformation and back-transformation can be skipped.
Simulations are employed as for the multi-SNV case, the null distribution of T F T is estimated, and the final p-value p ASE is calculated as in (24). With large number of simulations, p ASE will approximate the p-value from a two-tailed binomial exact test to an arbitrary degree of precision. The reason that we do not use the binomial test explicitly is our preference for a computationally uniform approach in both single-and multi-SNV scenarios.

Adjustment for pre-existing allelic bias
We assume once again that true phasing is known and extend model (8) to include situations of pre-existing allelic bias, by letting where we allow We further impose constraints (6) on f hap1,SN V (p hap1,g ). We assume that f hap1,SN V (0.5) (probability of observing haplotype 1-supporting allele under conditions of no ASE) is known (or previously estimated by MBASED, see package vignette for details).
As an example of complications introduced by pre-existing allelic bias, consider the situation where reference bias exists at some SNV, such that probability of observing the reference allele under the conditions of no ASE is Then, for a total coverage of 20 reads, observing 14 reference-supporting reads should be considered the perfect case of no ASE (14/20 = 0.7, the expected fraction of reference reads under the no-ASE model), and having 12 reference-supporting reads would render the alternative allele 'major' and the reference allele 'minor' (since we observe fewer reference allele counts than we expect: 12/20 < 0.7). Below we describe how the MBASED algorithm can incorporate information about pre-existing allelic bias into its computations.
At each SNV we employ the Freeman-Tukey transformation (10). Under the assumptions of model (27), the transformed values z hap1,SN V are realizations of independent (approximately) normal random varaibles Z hap1,SN V , with 5) , and the last equality is obtained from equation (6). In particular, under conditions of no ASE (p hap1,g = 0.5), we have Pre-existing allelic bias has no effect on the variance of Z hap1,SN V due to the variance-stabilizing property of Freeman-Tukey transformation, and we have the same as in equation (12). We need to modify our procedure for combining estimates of ASE at individual SNVs into a genelevel estimate to account for the fact that E(Z hap1,SN V ) may no longer be constant across SNVs within the same gene and is no longer equal to 2 sin −1 ( √ p hap1,g ). Consider a shifting transformation It follows that and we are once again within the framework described in section 2.2.2. Note that the shifting transformation (32) has no effect on variance: To see how successful this shifting transformation is in bringing the means of Z hap1,SN V 's close to 2 sin −1 ( √ p hap1,g ) when p hap1,g = 0.5, we assessed the performance of the shift across a range of values of p hap1,g and f hap1,SN V (0.5) (Figures S16 and S17). We find that the performance is remarkably good, especially for mild values of allelic bias 0.
The pre-existing allelic bias adjustment is incorporated into a 1-sample MBASED analysis in the following ways: 1. Haplotype phasing (if true haplotypes are unknown) : At each SNV we assign reference allele to haplotype 1 ('major') if and to haplotype 2 ('minor') otherwise. If f ref,SN V (0.5) = 0.5 (no pre-existing allelic bias), this is the same condition as before, since the expression (36) reduces to In all cases, the condition is essentially checking whether the observed reference allele count is larger or smaller than expected under conditions of no ASE.

Estimating
which is the ratio of all reference reads to total reads in the sample, after excluding the loci with the top 5% of read counts (trimmed for robustness). Note that our estimate of f ref,SN V (0.5) is samplespecfic. We find that across 32 samples, the no-ASE reference probability f ref,SN V (0.5) ranges from 0.50 to 0.55 with a median of 0.52. In other words, we observe mild levels of pre-existing allelic bias that favors reference allele. In cases where we merged multiple SNVs into a single locus because they were covered largely by the same set of reads, we used a slightly more complicated procedure to obtain estimate of pre-existing allelic bias. A quick example will suffice to illustrate our approach. Consider a locus with 2 SNVs. Since the SNVs are overlapped by the same reads, we can resolve them unambigously into 2 haplotypes. Suppose, that the two resolved haplotypes at this locus are R 1 = {ref, ref } and R 2 = {alt, alt}, and suppose that the allele R 1 belongs to haplotype 1 for gene g. Let p ref denote the constant f ref,SN V (0.5) we estimated for the sample in hand. Then we model the probability of observing a read supporting haplotype 1 at this locus under the null hypothesis of no ASE as Similarly, if the two locus haplotypes are R 1 = {ref, alt} and We then let p hap2 = 1 − p hap1 . The reasoning used here is entirely analogous to that used to define proportionality conditions (6), with overall (locus) allele probability under conditions of no ASE proportional to allele probabilities at each individual SNV. We then treat the locus as a single SNV, with allelic probabilities under conditions of no ASE given by f R1,SN V (0.5) and f R2,SN V (0.5), and use read counts supporting R 1 and R 2 in our calculations, analogously to using read counts supporting reference and alternative alleles in the case of a locus with a single SNV. We stress that this calculation is not part of the MBASED algorithm, but is a feature of our preprocessing of the data prior to running MBASED.

Adjustment for technical overdispersion
We now extend the binomial model (27) from the preceding section to the beta-binomial model (4) used by MBASED, where we allow We assume that true phasing is known and that the value of ρ SN V for each SNV is known (or previously estimated by MBASED, see package vignette for details).
Consider SNV-level FT-values Z hap1,SN V,adj , defined in (32). While switching from binomial to beta-binomial distributional assumptions on X hap1,SN V may prevent Z hap1,SN V,adj from following (approximately) the normal distribution, we note that the assumption of normality is never explicitly employed by MBASED. Instead, the empirical distribution of the test statistic T F T is estimated in order to assign statistical significance to the observed ASE. Therefore, our concern is only with the effects of overdispersion on the mean and variance of Z hap1,SN V,adj . Let Then is another parametrization of (4). Note that Both 2 sin −1 f hap1,SN V (0.5) and 2 sin −1 ( √ 0.5) are constants. Furthermore, by law of total expectation: Taking the ratio of the second term in the expansion to the first term, we observe that when 0 ≤ ρ ≤ 0.02, 0.2 ≤ f hap1,SN V (0.5) ≤ 0.8, and 0.1 ≤ p hap1,g ≤ 0.9. We believe these values to encompass the range of overdispersion and pre-existing allelic bias commonly observed in real data. For example, we estimate ρ ≤ 0.01 in all of our normal data sets (see discussion below), while other authors report overdispersion several orders of magnitude lower [23]. Therefore and from (43): Plugging in the value of µ = f hap1,SN V (p hap1,g ), we observe that this is the same expression as (33), and therefore E(Z hap1,SN V,adj ) is unaffected by the presence of overdispersion. Just as in section 2.2.4, we performed simulations across a range of values of ρ, f SN V,hap1 (0.5) and p hap1,g to assess how good the shifting transformation (32) is at bringing the means of Z hap1,SN V 's close to 2 sin −1 ( √ p hap1,g ). We find that even for very large values of ρ (ρ = 0.1), the approximation performs similarly to the case of no overdispersion (data not shown). Now, consider We have since, due to the variance-stabilizing property of Freeman-Tukey transformation, var(Z hap1,SN V,adj ) does not depend on Y . Further, using the first two terms from the Taylor series expansion employed in (45), we get Note that it is identical to the previous formula for var(Z hap1,SN V,adj ) when ρ = 0. Simulations across a range of values of ρ, f SN V,hap1 (0.5) and p hap1,g showed approximation (52) to be quite excellent at moderate values of overdispersion (ρ < 0.02, data not shown). MBASED adjusts for technical overdispersion in the following ways: 1. Meta-analysis: we use the new variance estimates (52) when calculating weights (14) distribution.

Estimating ρ SN V
By default, MBASED assumes that at each locus ρ SN V = 0, i.e., that there is no overdispersion due to technical effects. The user can supply the value of ρ SN V for each locus, or supply a list of read counts from which this value should be estimated (see package vignette for more details). In our analysis, we estimated ρ SN V within 7 normal samples only, since large amount of copy number alterations in tumor samples will lead to very high estimates of overdispersion in the data. We assumed that ρ SN V is constant across all SNVs within a sample (global overdispersion effect), and we estimated this value by maximum likelihood within each sample. At each SNV, we found the likelihood for the observed number of reference allele-supporting reads under the The value of ρ maximizing the joint likelihood over all SNVs was then used as an estimate of ρ SN V at all SNVs in that sample. The estimation procedure only included SNVs that were not parts of larger merged loci. To eliminate potential artifacts that may affect the model fit, we further required that the SNV be detected in at least 2 samples, with at least 1 sample presenting the reference allele as major and at least 1 sample presenting the alternative allele as major. Finally, after the initial model fit, we calculated the model-based p-value at each SNV, removed all SNVs with Benjamini-Hochberg adjusted p-values ≤ 0.05 (outliers), and re-fit the model to obtain the final estimate. We found that a small number of SNVs get removed by the outlier filtering (≤ 1% in each sample), and that no SNVs showed small adjusted p-values after the second fit. We found this step of outlier removal quite useful, as discaring one or two SNVs often led to two-fold decrease in overdispersion estimate. Moreover, since the decision to remove an SNV was based on adjusted p-value, we strongly feel that the final estimates represent the appropriate model parameters for all but a hanful of SNVs in the data. We found that for 5 out of 7 normal samples, overdispersion estimates ρ are between 0.001 and 0.005, with the other 2 exhibiting extreme values of 0.0000005 and 0.01. The median ρ across all 7 normal samples was 0.004 and this is the value we used for all samples (both tumor and normal) in the current study.

2-Sample Analysis
In some situations it may be desirable to compare allelic expression between two samples. One scenario for this situation is comparing tumor and normal samples from the same individual to detect tumor-specific ASE or loss of imprinting (normal-specific ASE). Another scenario is comparing RNA counts to those form genomic DNA to identify instances of ASE which are not associated with underlying copy number imbalance. In what follows, we will let sample 1 stand for 'tumor' and sample 2 stand for 'normal'. Note that the presented approach is not symmetrical with respect to the two samples, and care should be taken in deciding which sample to designate as either sample 1 or sample 2. We assume that there are exactly 2 distinct haplotypes for each tested gene, that the haplotypes are the same between the two samples, and that for gene g and sample s ∈ {sample1, sample2}, the observed number of reads supporting haplotype 1 at an exonic heterozygous SNV X hap1,SN V,s follows the beta-binomial model (4), which we reproduce here: Note that model parameters are sample-specific. We denote the observed counts in sample 1 and sample 2 by x hap1,SN V,sample1 and x hap1,SN V,sample2 , respectively. As with the 1-sample MBASED algorithm, we assume that SNV-level counts within a gene are independent, and we further assume independence of SNV-level counts between the two samples. We choose the haplotype frequency difference for some choice of haplotype 1, defined as as our measure of sample-specific ASE (or between-sample allelic imbalance). This is known as 'excess risk' in the epidemiology literature. We are interested in testing the null hypothesis of no sample-specific ASE: Note that we do not require that p hap1,g = 0.5 in the null hypothesis (59), since in the 2-sample analysis we are only interested in the genes with sample-specific ASE, rather than all genes exhibiting ASE, which may include instances of ASE common to both samples.
As with 1-sample analysis, we will first describe the algorithm under the assumption of binomial read counts (ρ SN V,s = 0 for both samples) with no pre-existing allelic bias (f SN V (p hap1,g,s ) = p hap1,g,s for both samples), reducing the model to We then demonstrate the extensions to take pre-existing allelic bias into account (f SN V (p hap1,g,s ) = p hap1,g,s , section 2.3.4) and to include overdispersion (ρ SN V,s = 0, section 2.3.5).

Multi-SNV genes
Consider a gene with several heterozygous exonic SNVs, and suppose that phasing information is available, i.e., for each SNV we know which allele belongs to haplotype 1 and which to haplotype 2. Meta-analysis techinques can be employed to test a multi-SNV gene for allelic imbalance between the two samples in a manner analogous to that in the 1-sample case. The only difference between the two cases is the origin and interpretation of (approximately) normal variables Z hap1,SN V that are combined. Consider the following SNV-level estimate of D: where for s ∈ {sample1, sample2}. This estimate is simply a difference in proportions of haplotype 1 allele-supporting reads between the two samples at that SNV, and we will call Z hap1,SN V the PD-value (for 'Proporition Difference') and denote its observed value by z hap1,SN V . We further denote the observed proportions of haplotype 1 allele-supporting reads by p hap1,SN V,sample1 and p hap1,SN V,sample2 . Note the difference between the observed proportion of haplotype 1 allelesupporting reads p hap1,SN V,s and the true (unkown) underlying frequency of haplotype 1 transcripts p hap1,g,s .
Under the assumptions of binomial model (60), we have and Furthermore, Z hap1,SN V is an approximately normal random variable with and Since the true values p hap1,g,sample1 and p hap1,g,sample2 are unknown, we estimate the variance of Z hap1,SN V as is the attenuated observed proportion of haplotype 1 allele-supporting reads. Note that the variance estimate (67) results from adding a pseudocount of 0.5 to each allele in both samples. This is done to avoid the situations where the estimated variance is 0 (e.g., if observed haplotype 1 proportions in both samples are 1). We do not attenuate the observed proportions when calculating z hap1,SN V in equation (61). Following the meta-analysis procedure described in section 2.2.2, a gene level PD-value z hap1,g is computed as the weighted mean of the SNV-level PD-values z hap1,SN V , with the weights given by the inverse of the estimated variances of Z hap1,SN V 's and normalized to add up to 1. We let Z hap1,g denote the random variable corresponding to the gene-level PD-value. The observed genelevel PD-value z hap1,g is the estimate of the haplotype 1 allele frequency difference between the two samples, Analogously to the 1-sample case, DerSimonian-Laird constancy statistic Q is calculated and the corresponding heterogeneity (inter-loci variability) p-value is recorded, with small p-values indicating evidence for isoform-specific between-sample allelic imbalance or other source of excessive variability. Same considerations as in 1-sample case apply to interpretation of heterogeneity p-values and to the power of statistical tests based on Q. We note that unlike the 1-sample approach presented earlier, the SNV-specific measures of ASE used in the 2-sample algorithm are not variance-stabilized. This leads to a certain bias in the calculation of z hap1,g , as some extreme values of z hap1,SN V , e.g., those close to 1, corresponding to p hap1,SN V,sample1 ≈ 1 and p hap1,SN V,sample2 ≈ 0 will be assigned greater weight due to the small values of their estimated variances (given in (66)). While variance-stabilizing transformations have been proposed for differences of binomial proportions [27], their variance-stabilizing properties only hold when the true value of D is close to that specified by a null hypothesis of interest (e.g., D = 0). We could have adopted this approach for the purpose of obtaining a measure of statistical significance of observed deviations from the null, but there is no straightforward way to simultaneously obtain the estimate of the observed ASE that is not computationally determined by the null hypothesis. We plan to address this issue in a future release of MBASED. When phasing information is unavailable, MBASED modifies the meta-analysis approach described above in the 2-sample setting as follows. As a first step, SNV-level counts within each gene are phased into haplotypes in the manner identical to that of the 1-sample analysis, with one caveat: phasing of both sample 1 and sample 2 is based solely on allelic counts from sample 1. In other words, all alleles presenting as major in sample 1 are assigned to the same haplotype (which we denote as 'haplotype 1' by analogy with 1-sample case, see section 2.2.2). This introduces a certain asymmetry into the analysis, but ensures that both samples have the same pair of haplotypes. The choice of sample 1 instead of sample 2 for this phasing step is motivated by the fact that it is the ASE in sample 1 which is of interest, and that for genes showing true sample 1-specific ASE, the allele counts in sample 1 are more informative for haplotype reconstruction. Additionally, 'voting'based pseudo-phasing of SNVs into haplotypes based on both samples could be highly misleading in cases where the direction of within-sample ASE changes between the two samples. For example, if the ratio of the two haplotypes is 0.7/0.3 in the normal sample and 0.3/0.7 in the tumor sample, then the 'voting'-based pseudo-phasing of SNVs into 'major' and 'minor' haplotypes based on read counts in both samples is likely to result in chimeric haplotypes, when phasing based on any one single sample would likely recover the true underlying haplotypes.
After 'voting'-based pseudo-phasing is performed, we obtain a gene-level estimate of the betweensample allelic imbalance T P D and the heterogeneity statistic Q (with corresponding p-value), as described above. We then estimate the null distribution of T P D for each gene using simulations, analogous to 1-sample case, and assign a final ASE-pvalue p ASE as follows.
Recall that under the null hypothesis of no sample-specific ASE (58), we have p hap1,g,sample1 = p hap1,g,sample2 = p hap1,g where hap1 and hap2 refer to true underlying haplotypes (and not to haplotypes 1 and 2 obtained based on the 'voting' pseudo-phasing of sample 1 allele read counts). Also, under the null hypothesis, let us denote p maj,g = max{p hap1,g , p hap2,g } In each round of simulations, one allele at each SNV is randomly assigned to major haplotype, and the simulated major haplotype counts x sim,maj,SN V,sample1 and x sim,maj,SN V,sample2 are drawn independently from Bin(n = n SN V,sample1 , p = p maj,g ) and Bin(n = n SN V,sample2 , p = p maj,g ) distributions, respectively (we address estimation of p maj,g in section 2.3.2). The SNVs are then phased into haplotypes 1 and 2 based on the 'voting' algorithm applied to the simulated counts in sample 1, and the proportion difference T sim,P D is calculated. The final ASE p-value is derived as where N sim is the number of simulations. Note that the empirical significance test (74) is one-sided.
The reason for this is as follows. We expect that in cases of true strong sample 1-specific ASE, our 'voting' pseudo-phasing procedure will recover true underlying haplotypes, with haplotype 1 being the major of the two. This in turn results in p hap1,g,sample1 = p maj,g,sample1 in equation (57), producing a large positive value of D = T P D , unless the ASE is not specific to sample 1 and p maj,g,sample2 is also large.
In principle, the simulations can also be employed when dealing with paired samples where the true haplotypes are known, just as in 1-sample analysis. Finally, we note that the heterogeneity p-value is calculated based on simulations, in a manner identical to the 1-sample analysis (equation 25

Estimation of p maj,g
There is an additional layer of complexity involved in the 2-sample analysis. The null hypothesis of no sample-specific ASE (59) does not specify the exact value of p hap1,g . If true phasing is known, we can obtain maximum likelihood estimate p hap1,g under the null as follows. For a given value of p hap1,g the likelihood of the observed data at an individual SNV under the null hypothesis (59) and data-generating binomial model (60) is where p hap2,g = 1 − p hap1,g and The joint likelihood across all SNVs in gene g is given by with p hap1,g being the value of p hap1,g which maximizes the joint likelihood (78). We can then obtain estimate p maj,g = max{ p hap1,g , 1 − p hap1,g } When phasing is unknown, this procedure needs to be modified. One solution would be to use the haplotypes produced by MBASED using 'voting'-based pseudo-phasing approach on sample 1 read counts, since these are the haplotypes used to obtain D = T P D (see above). However, this approach tends to consistently overestimate p maj,g under the null hypothesis for the following reason. Our pseudo-phasing algorithm results in haplotype 1 such that ≥ 0.5 at each SNV, producing strong evidence for p haplotype1,g > 0.5. Even when information from the second sample is taken into account in the likelihood expression (75), the end result of maximizing the likelihood, in general, will be identification of haplotype 1 as 'major'. While this poses no problem when haplotype 1 does in fact correspond to the major haplotype of the gene (as is likely to be the case when sample 1 exhibits strong ASE, whether sample-specific or not), it will artificially inflate p maj,g = p haplotype1,g otherwise. Therefore, we adopt a different strategy and estimate p maj,g under the null hypothesis by maximizing over all possible assignments of SNV-level alleles to gene haplotypes. In other words, if a gene contains 3 heterozygous SNVs, we maximize likelihood over all 2 3 = 8 possible haplotype reconstructions. We describe this procedure in detail below. For clarity of notation, we denote two arbitrary haplotypes of a gene as hapA and hapB, to distinguish them from previously used hap1 and hap2.
We re-write the likelihood (75) of the observed data at an invididual SNV under the null hypothesis for a given haplotype phasing into hapA and hapB as where the indicator variable I SN V,ref =hapA ∈ {0, 1} is used to denote whether the reference allele belongs to hapA (1) or hapB (0), and p hapB,g = 1 − p hapA,g . Thus, for each of the 2 possible haplotype assignments of the reference allele, only 1 of the 2 summands in expression (80) is nonzero.
The joint likelihood across all SNVs within a particular gene can be written as For each choice of haplotype assignments hapA and hapB, we let p hapA,g be the value of p hapA,g that maximizes the joint likelihood (81), with the corresponding likelihood value L( p hapA,g |hapA).
Finally, among all haplotype assignment-specific likelihood values L( p hapA,g |hapA), we choose the maximum one, and denote the corresponding p hapA,g by p hapA * ,g . We then let p maj,g = max{ p hapA * ,g , 1 − p hapA * ,g } We note that the obtained estimate is biased in cases where true p maj,g ≈ 0.5, and this may lead to some anti-conservative p-value calls on the part of MBASED. In particular, we may observe enrichment for low p-values if the estimated widehatp maj,g is sufficiently greater than p maj,g = 0.5. We found, however, that we do not observe such enrichment under the levels of dispersion seen in our data, although we do see occasional instances of this if we assume binomial (no dispersion) model (data not shown). This limitation should be addressed in future release of MBASED.

Single-SNV genes
The 2-sample analysis for single-SNV genes is identical to that for multi-SNV genes (including simulations), except that the gene-level PD-value T P D = z hap1,g is the SNV-level PD-value z hap1,SN V .

Adjustment for pre-existing allelic bias
We extend model (60) to include situations of pre-existing allelic bias by letting for s ∈ {sample1, sample2}, where we allow f hap1,SN V,s (p hap1,g,s ) = p hap1,g,s for at least one sample. We still assume functional form (6) for f hap1,SN V,s (p hap1,g,s ). Note that we do not assume that pre-existing allelic bias is the same for both samples, i.e., we allow Under the assumptions of model (83), we obtain from (62) that and It follows that Comparing to equation (57), we note that in general unless there is no pre-existing allelic bias (f hap1,SN V,s (p hap1,g,s ) = p hap1,g,s ) for both samples. Of course, we might still have E(Z hap1,SN V ) = D for some specially chosen values of p hap1,g,s and f hap1,SN V,s (0.5). We need to modify our procedure for combining estimates of ASE at individual SNVs into a gene-level estimate to account for the fact that E(Z hap1,SN V ) may no longer be constant across SNVs within the same gene and is no longer equal to D. Consider a transformation based on our approach to handling pre-existing allelic bias in the 1-sample analysis (section 2.2.4): where s ∈ {sample1, sample2}. In words, we use Freeman-Tukey transformation on observed haplotype 1 allele-supporting counts in each sample and adjust the resulting FT-value in the same manner as we did in 1-sample case to obtain Y hap1,SN V,s . We then backtransform the shifted FT-value into a shifted proportion (P hap1,SN V,s,adj ) and let Z hap1,SN V,adj be the difference be-tween the shifted proportions in samples 1 and 2. We denote the observed values of random variables P hap1,SN V,sample1,adj , P hap1,SN V,sample2,adj , and Z hap1,SN V,adj by p hap1,SN V,sample1,adj , p hap1,SN V,sample2,adj , and z hap1,SN V,adj , respectively. Note, that transformation (90) yields previously described model (61) in the case of no pre-existing allelic bias (f hap1,SN V (0.5) = 0.5), since in such case we have Y hap1,SN V,s = F T (X hap1,SN V,s , n SN V,s ) The goal of transformation (90) is to produce shifted proportion P hap1,SN V,s,adj , such that We find that approximation (92) holds for a wide range of values of p hap1,g,s and f hap1,SN V,s (0.5), with slightly worse performance for extreme values of pre-existing allelic bias ( Figure S17).
To address the issue of variance of P hap1,SN V,s,adj , we suggest the following approximation: We found this approximation to perform well in general, and especially so for mild values of preexisting allelic bias (0.4 ≤ f hap1,SN V,s (0.5) ≤ 0.6) and for moderate-to-high coverages (n SN V,s ≥ 20) ( Figure S18). We plan to address any remaining inadequacies in a future release of MBASED. Based on approximations (92) (96) 4. Estimating p maj,g : Based on the binomial distribution model (83), we re-write the binomial likelihood expression (75) at a single SNV as: where f hap1,SN V,s (p hap1,g ) is as defined in (6) and is uniquely determined by unknown p hap1,g and f hap1,SN V,s (0.5) (assumed to be known or previously estimated by MBASED). This modification is then used in the procedure described in section 2.3.2. distribution.
Just as with the one-sample algorithm, MBASED requires that values of f hap1,SN V,s (0.5) be supplied for each sample and each SNV. In our study we used sample-specific estimates obtained as described in section 2.2.5.

Adjustment for technical overdispersion
We now extend binomial model (83) to full model (56) used by MBASED, where we allow that for at least one sample s ∈ {sample1, sample2}.
Following the same reasoning as in section 2.2.6, it is only necessary to consider the effect of overdispersion on the mean and variance of Z hap1,SN V,adj . From the properties of beta-binomial distribution in section 2.1.2, for s ∈ {sample1, sample2} we have and var(P hap1,SN V,s ) = f hap1,SN V,s (p hap1,g,s ) × (1 − f hap1,SN V,s (p hap1,g,s )) Note that under conditions of no overdispersion (ρ SN V,s = 0), expression (104) reduces to which is identical to previously derived expression (87) for the model with no overdispersion. Based on the generative beta-binomial model (56) and shifting transformation (90), MBASED relies on the following approximations to take technical overdispersion into account: and Approximation (106) is identical to approximation (92) made in section 2.3.4, and we find that its accuracy is virtually unaffected by the choice of overdispersion parameter ρ SN V,s even for very large values of overdispersion (ρ SN V,s = 0.1).
Approximation (107) is analogous to approximation (93) made in section 2.3.4, and we find that it performs very well for moderate values of overdispersion ρ SN V,s ≤ 0.02 and for moderate-to-high coverage levels (nSN V, s ≥ 20, data not shown). The performance is especially good for mild levels of pre-existing allelic bias (0.4 ≤ f hap1,SN V,s (p hap1,g,s ) ≤ 0.6).
Based on approximations (106) and (107), we conclude that under full model (56), we have and var(Z hap1,SN V,adj ) ≈ E(P hap1,SN V,sample1,adj ) × (1 − E(P hap1,SN V,sample1,adj )) Analogously to the approach we took in section 2.3.4, we estimate variance of Z hap1,SN V,adj as where for s ∈ {sample1, sample2} p hap1,SN V,sample1,adj = N I(p hap1,SN V,s,adj × n SN V,s ) + 0.5 n SN V,s + 1 (111) with N I denoting the nearest integer function. Note that we assume that values of ρ SN V,s for both samples are supplied to MBASED (see below). The adjustment for technical overdispersion is incorporated into a 2-sample MBASED analysis in the following ways, which parallel the adjustments made in 1-sample case (section 2.2.6): 1. Meta-analysis: We use new variance estimates (110) when calculating weights used to derive z hap1,g 2. Estimating p maj,g : Based on distribution model (56) and properties of beta-binomial distribution, we re-write the binomial likelihood expression (99) at a single SNV as beta-binomial likelihood: where and Since likelihood (112) is undefined when ρ SN V,s = 0, we use binomial likelihood expression (99) in such cases. Note that f hap1,SN V,s (p hap1,g ) is as defined in (6) and is uniquely determined by unknown p hap1,g and f hap1,SN V,s (0.5) (assumed to be known or previously estimated by MBASED). Furthermore, ρ SN V,s is also assumed to be know, leaving p hap1,g as the only variable in the likelihood expression (112). This modified likelihood is then used in the procedure described in section 2.3.2 to obtain p maj,g . distribution.
Just as with the one-sample algorithm, MBASED requires that values of ρ SN V,s be supplied for each sample and each SNV. In our study we used sample-specific estimates obtained as described in section 2.2.7.

MBASED Performance in Absence of Phasing Information
A natural question to ask is how well MBASED performs when the true haplotypes are unknown, compared to the situation when they are known. We attempted to answer this question by comparing the outcomes of running MBASED with and without provided haplotypes on a data set where the true haplotypes are known. The issue of whether the performance of the 'phased' version of MBASED is sufficient to be used as a 'golden standard' in the comparison we perform here will be addressed in the next section. We obtained previously published LCL RNA-Seq data and a list of phased genomic variants for (female, non-cancer) individual NA12878, genotyped together with both parents as part of 1000 Genomes Project [7]. The RNA-Seq data was downloaded from GEO using accession number GSE30401. The list of variants, together with the phasing information was downloaded from http://archive.gersteinlab.org/proj/AlleleSeq/. Only the phased heterozygous variants were used for the analysis. We pre-processed the data analogously to other samples in our panel, with exception of filtering based on observed variant allele read frequency at DNA level. This last filter was not preformed as it was not clear which WGS data set available for this individual should be used. Instead, we relied on the variant filtering pipeline employed by the 1000 Genomes Project, and assumed that any remaining heterozygous variants were genuine. We only aligned RNA-Seq data sets corresponding to paired end sequencing results, and pooled all of these runs together, as they appeared to be technical replicates sequenced on the same Illumina flow cell.
We ran MBASED on this data set both with ('phased') and without ('non-phased') specifying the true haplotypes. Overall, we tested 2,560 genes for ASE, including 1,104 (40%) with > 1 heterozygous loci. Using our usual cutoffs (MAF ≥ 0.7, adjusted p-value ≤ 0.05), MBASED found 110 genes to show ASE in the 'phased' setting and 115 genes in the 'non-phased' setting, of which 108 were in common, indicating high degree of consistency. Overall, we observed ASE in ∼ 4.5% of the tested genes, with 45% of ASE genes found on chromosome X, due to chromosome X inactivation in this female individual.
Conceptually, knowing the true haplotypes provides two advantages in ASE detection. First, it increases power to detect ASE. Consider a situation shown below: If we do not know the true haplotypes, then we have to assume that the pairs of counts above can be assigned to haplotypes as {26, 22}/{15, 4} or {26, 4}/{15, 22}. The former case would lead to higher and more significant estimate of gene-level ASE than the latter. Since neither possibility can be excluded, our confidence in ASE of this gene will be lower when we do not know true haplotypes than it would be if we did. This is reflected in MBASED algorithm, where the {26, 22}/{15, 4} and {26, 4}/{15, 22} pairs of haplotype counts produce the same MAF estimates, unless we supply true haplotype phasing information. Such situations may lead to ASE being detected by the 'phased' but not the 'non-phased' approach.
Another advantage provided by knowing the true haplotypes is the increased power to detect isoform-specific effects. Consider a situation shown below: If we do not know the true haplotypes, the MBASED MAF estimate for this gene will be based on {56, 41}/{5, 4} pair of haplotype counts, due to pseudo-phasing, likely resulting in a call of highly significant ASE. Importantly, the pseudo-phased allelic counts show high concordance between the two loci and the estimate of inter-loci variability will be low. Thus we will miss entirely the fact that there is considerable disagreement between the two SNVs within this gene with respect to which haplotype is overexpressed and by how much. On the other hand, if we know the true haplotypes, the MBASED estimate of MAF will be close to 0.5 and the estimate of inter-loci variability will be large, allowing us to detect the disagreement between the individual SNVs. Such behavior may arise, for example, if the two SNVs fall into two exons, used exclusively by two distinct transcript isoforms of the gene, with each transcript isoform subject to allele-specific expression. Such situations may lead to ASE being detected by the 'non-phased' but not the 'phased' approach (although note that the 'phased' approach will produce highly significant measure of inter-loci variability, which can be used to flag such situations for further investigation).
In our acutal data, the two instances where ASE is detected by the 'phased' approach only illustrate the increased power of ASE detection when true underlying haplotypes are known ( Table  1). The two gene are UBE3B and L3MBTL2 (the data for L3MBTL2 is shown in Table 1), and the adjusted p-values given by the 'phased' approach are 0.048 and 0.036, respectively, compared to p-values of 0.081 and 0.062, produced by the 'non-phased' approach (the MAF estimates are the same). Note that while the 'non-phased' approach did not declare these genes as showing ASE, the actual adjusted p-values are still close to our chosen cutoff of 0.05. We therefore conclude that at the coverage levels observed in this RNA-Seq data set (inter-quartile range of 17-59 reads/SNV), the additional power provided by knowing the true haplotypes results in very minor improvement in ASE calling.
Out of 7 cases of ASE detected by the 'non-phased' approach only, 3 are due to small differences in adjusted p-values at the margins of our cutoff (e.g, 0.049 vs. 0.0051, Figure S7). The remaining 4 cases represent situations where allelic counts point towards over-expression of different haplotypes at individual loci within a gene (example in Table 2). Of these 4 instances, we find that 3 are likely the artifacts of alignment procedure: 1. RAB5A: There are 2 heterozygous SNVs, with haplotype-specific read counts of {48, 0}/{19, 11}.
However, the second (homozygous) SNV is inside the 3' UTR and is within 150bp of an exon of another gene ( Figure S19).  Figure S21).
The remaining case is that of gene MTFP1. There are 2 heterozygous SNVs, with haplotypespecific read counts of {11, 51}/{12, 18}. Here, there seems to be genuine difference between the first SNV (no ASE) and the second SNV (overexpression of first haplotype). The second SNV is located within 3' UTR of the gene, and there are 2 distinct isoforms of MTFP1, suggesting that true isoform-specific ASE may be taking place ( Figure S22). However, it should be noted that the estimates of MAF and statistical significance are very similar for both the 'phased' and 'non-phased' approach for this gene ( Figure S7). The 'phased' approach estimates MAF as 0.68 (and does not pass our cutoff), while the 'non-phased' approach resolves the two haplotypes to consist of read count pairs {12, 51}/{11, 18}, resulting in MAF estimate of 0.70 (passing the cutoff). In addition to the cases of genes declared ASE by the 'phased' or 'non-phased' approaches exclusively, we have also investigated the instances of genes which fail to be correctly resolved into the true underlying haplotypes by the 'non-phased' approach. Such genes may or may not be declared ASE by both approaches. We find that we successfully reconstruct haplotypes of 40/47 (85%) ASE ('non-phased' approach) genes with > 1 heterozygous loci (Additional File 3, Data Set S1), including 18/18 genes on chromosome X, where the signal is the strongest (due to chromosome X inactivation in this female individual). Of the 7 genes where haplotype reconstruction fails, 6 are likely artifacts of alignment procedure, including RAB5A, PPIA, and SPN, discussed above. Another 3 genes (PPM1K, PM20D2, and APOL2) exhibit high levels of expression in intronic regions, combined with the presence of all heterozygous variants in UTR regions (data not shown), similar to the case of PPIA gene. The remaining one instance is the gene MTFP1, discussed above.
We conclude that for the testing data set, the improvements in performance of MBASED obtained by having access to true gene haplotypes are very minor. We further conclude that MBASED exhibits extremely high recovery rate of true underlying haplotypes of ASE genes.

Comparison of MBASED to the Method of Skelly et al.
To assess how well MBASED performs when the true haplotypes are known, we compared its performance to the method of Skelly et al. [23] ('Skelly Method'). Skelly method is the only currently published ASE detection approach that allows for variable ASE within a gene. It adopts a Bayesian approach and estimates the gene-level posterior distribution of major allele frequency, as well as the posterior probability of ASE. We adopted the suggestions from the Skelly method tutorial on setting the parameters for MCMC runs. Since we lacked DNA data to estimate prior parameters for Skelly method, we adopted approach identical to Skelly et al. [23], who, when faced with the same predicament, employed the prior parameters from their yeast data analysis. We thus setâ = 3600 andd = 550. Several runs were performed, yielding very similar results. Only one set of results is discussed here, for clarity of interpretation (Additional File 3, Data Set S1).
To define ASE genes from the output of Skelly method, we required that the posterior probability of ASE be at least 0.95 and that the median of posterior distribution of haplotype 1 frequency was ≥ 0.7 (haplotype 1 is major) or ≤ 0.3 (haplotype 2 is major). This yielded 103 ASE genes, compared to 110 derived by the MBASED 'phased' approach. Of these, 94 genes were common to both methods. Table ?? shows 9 genes, found to exhibit ASE by Skelly method only. We observed that in all such cases, the SNV coverage was low, and while MBASED produced high estimates of MAF, the adjusted p-values fell slightly short of the chosen cutoff of 0.05 (note that the p-values and MAF estimates reflect adjustment for global reference bias, possibly aggregated across multiple SNVs within the same locus, explaining, e.g, difference between reported values for ACSL4 (2 SNVs merged into a single locus, alternative allele is major for both), APOLD1 (1 SNV, alternative allele is major), and HSPA13 (1 SNV, reference allele is major)).
We further investigated 16 genes which were declared as showing ASE by MBASED, but not by Skelly Method. We find that 15 of these have posterior probability of ASE equal to 1, but fail to pass our imposed major allele frequency cutoff (Skelly Method MAF 0.58-0.69). This is general feature of Skelly Method, and ineed of the Bayesian framework they adopt. Since the prior distribution heavily favors the setting of no ASE, this is reflected in the posterior distribution of major allele frequencies shifted towards 0.5 and away from the observed frequencies.
The remaining instance is that of gene UBE3B, which has Skelly Method posterior probability of ASE equal to 0.001 and Skelly Method MAF of 0.51, compared to MBASED adjusted ASE p-value of 0.048 and estimated MAF of 0.78. The observed difference is due to the preprocessing step we took when incorporating pre-existing allelic bias into probabilities of observing reads corresponding to individual haplotypes under the conditions of no ASE. For example, since the estimated pre-existing probability of observing reference allele is 0.52 for this data set, we assume that under conditions of no ASE, we would observe reference allele 52% of the time. The 3 heterozygous SNVs of gene UBE3B were resolved into a single locus by overlapping reads, leading to two haplotypes: one consisting of all alternative alleles and another consisting of all reference alleles. We therefore aggregated the pre-existing allelic biases across the SNVs to yield the no-ASE probability of observing the allreference haplotype of 0.56, very different from 0.5 assumed by the Skelly Method. Since, in fact, it is the all-alternative haplotype of the UBE3B gene that shows overexpression, MBASED ends up assigning very high significance to the observed ASE, and the adjustment for pre-existing allelic bias shifts the estimated MAF away from 0.5. We note that this observed difference between MBASED and the Skelly Method is not due to an algorithmic difference but to the additional pre-processing employed by us.
Overall, we conclude that the performance of MBASED is qualitatively similar to that of Skelly Method, with main difference being that the Bayesian framework of Skelly Method produces posterior distribution of haplotype frequencies shifted towards 0.5 relative to the observed frequencies, used by MBASED.  Figure S1. True positive rate (TPR) and false discovery rate (FDR) for 1-sample simulation results in absence of overdispersion (ρ = 0). Note that MBASED performance improves (higher TPR at the same nominally controlled FDR of 5%), especially at low signal (major allele frequency (MAF) = 0.7) as overdispersion decreases (cf. Figure 2, where we used value ρ = 0.004, estimated from real data (Supplementary Methods)). SE: estimated standard error. Figure S2. True positive rate (TPR) and false discovery rate (FDR) for 1-sample simulation results at high levels of overdispersion (ρ = 0.02). Note that MBASED performances worsens (lower TPR at the same nominally controlled FDR of 5%), especially at low signal (major allele frequency (MAF) = 0.7) as overdispersion increases (cf. Figure 2, where where we used value ρ = 0.004, estimated from real data (Supplementary Methods)). Figure S3. True positive rate (TPR) and false discovery rate (FDR) for 2-sample simulation results in absence of overdispersion (ρ = 0). Note that MBASED performance improves (higher TPR at the same nominally controlled FDR of 5%) as overdispersion decreases (cf. Figure 3, where where we used value ρ = 0.004, estimated from real data (Supplementary Methods)). MAF: major allele frequency. SE: estimated standard error. Figure S4. True positive rate (TPR) and false discovery rate (FDR) for 2-sample simulation results at high levels of overdispersion (ρ = 0.02). Note that MBASED performance worsens (lower TPR at the same nominally controlled FDR of 5%), especially at low signal (major allele frequency (MAF) = 0.7) as overdispersion increases (cf. Figure 3, where where we used value ρ = 0.004, estimated from real data (Supplementary Methods)). SE: estimated standard error. Genes called ASE by 'phased' approach only are shown in red. Genes called ASE by 'non-phased' approach only are shown in blue. Figure S5. True positive rate (TPR) and false discovery rate (FDR) for 1-sample simulation results in presence of pre-existing allelic bias (P ref = 0.6). Note that MBASED performs virtually the same as in the setting of no bias (cf.    Figure S8. Chromosome 12-specific genes with recurrent ASE in cancer tissue samples (cf. Figure  6A). X-axis correspond to the entire chromosome. Y-axis represents copy number (CN) state as l2r (log2 tumor/normal ratio). Horizontal segments show CN state along chromosome (red: detected allelic imbalance (AI); black: no detected AI). Green lines on top and bottom show globally estimated cutoffs for CN gain and CN loss, respectively (in some instances the global cutoffs do not fit chromosome-specific data well). Genes with recurrently detected ASE are plotted next to the segments in which they are located (connected with vertical dotted lines): black: gene not tested for ASE in this sample; red: gene found to show ASE in this sample; blue: gene found to not show ASE in this sample. The rug along the x-axis shows the location of all tested genes found to show ASE (red) and not to show ASE (blue). Recurrent CN alterations involving KRAS leads to recurrent ASE of nearby genes. Figure S9. Chromosome 17-specific genes with recurrent ASE in cancer tissue samples (cf. Figure  6A and Supplementary Figure 8). In this particular case, the recurrently deleted genomic fragment contains TP53, a known tumor suppressor. This recurrent deletion leads to recurrent ASE of nearby genes (likely due to heterozygous calls resulting from normal admixture). Figure S11. Percent of tested genes on chromosome X found to show ASE in female samples. In all cases (except H2009, which lost a copy of chromosome X), this percent is higher than the overall ASE rate, likely driven by chromosome X inactivation events. The percent is also much higher in cancer than in normal samples. Figure S12. Observed reference bias. For each sample, we take all heterozygous exonic SNVs, tally the number of RNA-Seq reads across all SNVs that support reference allele and divide this number by the total number of reads covering such SNVs (top 5% of SNVs by read count are excluded from the calculation). The resulting fraction is multiplied by 100 and plotted. A small but consistent preference for reference allele is observed.  show DNA-Seq alignments for the H1993 cell line from different platforms. There appear to be 2 distinct haplotypes: reference and alternative, with the latter haplotype containing all of the variants (except for the single variant appearing in RNA-Seq data). It is likely that a non-expressed homologous region absent from the reference genome is responsible for these observations. The tracks for other studied samples are similar.