Unit of AE data
The biological signal of interest in AE analysis is the relative expression of a given transcript from the two parental chromosomes. Typical AE data seek to capture this by counts of RNA-seq reads carrying reference and alternative alleles over heterozygous sites in an individual [heterozygous single-nucleotide polymorphisms (het-SNPs)], and this is the focus of our analysis unless mentioned otherwise. The Geuvadis samples with a median depth of 55 million mapped reads have about 5000 het-SNPs covered by ≥30 RNA-seq reads, distributed across about 3000 genes and 4000 exons (Fig. 2; Additional file 2). The exact number varies due to differences in sequencing depth, its distribution across genes, and individual DNA heterozygosity. About one half of these genes contain multiple het-SNPs per individual, which could be aggregated to better detect AE across the gene (Fig. 2d). However, alternative splicing can introduce true biological variation in AE in different exons, and incorrect phasing needs to be accounted for in downstream analysis [13]. Additionally, summing up data from multiple SNPs is not appropriate if the same RNA-seq reads overlap both sites. In the Geuvadis data, 9 % of the reads used in AE analysis in fact overlap more than one het-SNP (Figure S2d in Additional file 2), but this will become more frequent as read lengths increase [21]. In the future, better tools are needed to partition RNA-seq reads to either of the two haplotypes according to all het-SNPs that they overlap [22]. In fact, this could help to phase exonic sites separated by long introns.
AE analysis of small insertions or deletions (indels) has proven to be technically very challenging and it is rarely attempted even though frameshift indels are an important class of protein-truncating variant. Alignment errors over indel loci are pervasive due to multiple mismatches of reads carrying alternative alleles, and lower genotyping quality adds further error [12]. In Rivas et al. [12] we describe the first approach for large-scale analysis of AE over indels, but further methods development is warranted for better sensitivity and computational scalability.
In addition to classical AE analysis to detect differences in total expression level of two haplotypes, it is also possible to analyze allelic differences in transcript structure or splicing [allelic splicing (AS)] [5, 21]. These methods compare the exon distribution of reads and their mates carrying different alleles of a heterozygous site, and work increasingly well for longer total fragment lengths. In these analyses, the data structure is somewhat more complex than reference/non-reference read counts in AE, depending on the specific algorithm. While this paper focuses on classical AE analysis of SNPs, most of the quality analysis steps apply to indel AE and AS analyses as well.
Tools to retrieve allele counts
Allele counts are the starting point for all AE analyses, and many previous tools can retrieve these counts. However, they also perform other analyses that require additional input data and increase the runtime. Here we present simple tools that can be used to retrieve only allele counts, using the minimum required inputs in standard formats. We present two solutions: 1) a highly efficient Python tool that processes results from SAMtools mpileup, the framework used by the majority of existing AE analysis pipelines; and 2) an easy to use tool in the widely used GATK v.3.4 [23, 24] called ASEReadCounter, which does not require any additional setup, and includes a variety of easily customizable read processing options as well as professional maintenance and documentation, similar to other GATK tools. Both operate on aligned RNA-seq reads and count the reference and alternative allele reads that passed filters for mapping and base quality at each bi-allelic heterozygous variant. The GATK tool offers several additional options for processing RNA-seq reads: by default each read fragment is counted only once if the base calls are consistent at the site of interest, and duplicate reads are filtered (see below). Other options allow filtering for coverage and for sites or reads with deletions. The output of both is one file per RNA-seq input file, with one line per site displaying the counts for each allele as well as counts of filtered reads, and can be used for downstream analyses. The tools yield consistent results, with runtimes comparable to a previously published tool [25] (Additional file 3).
Quality control of allele counting
Retrieving allele counts from RNA-seq data over a list of heterozygous sites is conceptually very simple, but several non-trivial filtering steps need to be undertaken to ensure that only high-quality reads representing independent RNA/cDNA molecules are counted. The first commonly applied filter is to remove reads with a potentially erroneous base over the heterozygous site based on low base quality. Furthermore, potential overlap of mates in paired-end RNA-seq data needs to be accounted for, so that each fragment, representing one RNA molecule, is counted only once per het-SNP. In the Geuvadis data, an average of 4.4 % of reads mapping to het-SNPs per sample are derived from overlapping mates, but this number will vary by the insert size (Figure S4a in Additional file 4).
In RNA-seq analysis, duplicate reads with identical start and end positions are common (15 % of reads in Geuvadis AE analysis), because highly expressed genes get saturated with reads (Figure S4b, d in Additional file 4). Thus, by default, duplicates are usually not removed from RNA-seq data to avoid underestimating expression levels in highly expressed genes [5]. However, we observe consistent albeit infrequent signs of PCR artifacts in the Geuvadis AE data, especially affecting lowly covered sites — where duplicates are mostly true PCR duplicates, since saturation is unlikely. Removing duplicate reads reduces technical sources of AE at these sites, while having a minimal effect on highly covered, read-saturated SNPs (Figure S4e in Additional file 4). Thus, we suggest that removing duplicate reads is a good default approach for AE analysis, and it is implemented as a default in the GATK tool. However, it is important that the retained read is either chosen randomly or by base quality, and not by mapping score, so as not to bias towards the reference allele.
The most difficult problem in AE analysis and a potential source of false positive AE is ensuring that 1) all the reads counted over a site indeed originate from that genomic locus, and 2) all reads from that locus are counted. RNA-seq studies with shorter or single-end RNA-seq reads are more susceptible to these problems. First, to make sure that no alien reads get erroneously assigned to a locus, only uniquely mapping reads should be used. This implies that highly homologous loci — such as microRNAs — are not amenable to AE analysis.
An even more difficult caveat in AE analysis is allelic mapping bias: in RNA-seq data aligned to the reference genome, a read carrying the alternative allele of a variant has at least one mismatch, and thus has a lower probability to align correctly than the reference reads [26–28]. Simulated data in Panousis et al. [27] indicates substantial variation between sites — in most cases reads mapped correctly, but 12 % of SNPs and 46 % of indels had allele ratio bias >5 % with some having a full loss of mapping of the alternative allele. Loci with homology elsewhere in the genome are particularly problematic as reads have nearly equally good alternative loci to align to. Furthermore, even a site with no bias in itself can become biased due to a flanking (sometimes unknown) variant that shares overlapping reads with the site of interest. In addition, mapping bias varies depending on the specific alignment software used (Additional file 5).
Various strategies can be employed to control for the effect of mapping bias on AE analysis. The simplest approach that can be applied to AE data without realignment is to filter sites with likely bias [5, 8, 28]. In previous work [5, 8, 29–31] and in this paper, unless mentioned otherwise, we remove about 20 % of het-SNPs that either fall within regions of low mappability (ENCODE 50 bp mappability score < 1) or show mapping bias in simulations [27]. This reduces the number of sites with strong bias by about 50 % (Fig. 3b) but the genome-wide reference ratio remaining slightly above 0.5 indicates residual bias (Figure S6a in Additional file 6). Using this ratio as a null in statistical tests instead of 0.5 [5, 6] can improve results (Figure S6b–e in Additional file 6). More exhaustive but computationally intensive approaches include alignment to personalized genomes [18, 32, 33], or use of a variant-aware aligner, such as GSNAP [34]. These methods yield comparable results and eliminate average genome-wide bias (Fig. 3a; Additional file 5), but the fact that applying a mappability filter still removes monoallelic sites implies that not all bias is eliminated (Fig. 3b). In particular, in personalized or variant-aware approaches sites with homology elsewhere in the genome can have very substantial allelic mapping bias towards either the reference or non-reference allele, which occurs when reads carrying one allele map perfectly and reads with the other allele align to multiple loci. A novel approach is the specific removal of reads that show mapping bias with software such as WASP [35], which generally performs well, although some signs of residual bias still remain. Additional file 7 presents a summary of the strengths and weaknesses of each strategy. Altogether, while many approaches yield reasonably accurate data, allelic mapping bias remains a problem that cannot be perfectly eliminated with available solutions.
Quality control of genotype data
AE analysis relies on data of heterozygous sites to distinguish the two parental alleles. These genotype data are ideally retrieved from DNA-sequencing or genotyping arrays, but the RNA-seq data themselves can also be used for calling genetic variants and finding heterozygous sites [36–39]. However, true allelic imbalance can lead to heterozygous sites being called homozygous in RNA-based genotype calling and lead to substantial error in monoallelic genes due to, e.g., imprinting, and more subtle bias in expression quantitative trait loci (eQTL) genes (Figure S7a in Additional file 8).
Even when using heterozygous genotypes called from DNA data, genotyping error can be an important source of false signals of allelic imbalance, because AE data from a homozygous site appear as monoallelically expressed. In genotype data that has passed normal quality control (QC), including Hardy-Weinberg equilibrium test, genotype error will lead to rare cases of monoallelic expression per site, not shared across many individuals (Fig. 1b). False heterozygous genotype calls are rare but not negligible in AE analysis using SNP genotypes from arrays or from modern sequencing data, but much more common in imputed data (Fig. 4a). Calculating the genome-wide proportion of monoallelic AE sites per individual is a sensitive method for genotyping quality control (Fig. 4a, arrowheads).
Removing genotyping error is relatively straightforward for analysis of moderate allelic imbalance (such as that caused by cis-regulatory variants): removing monoallelic variants removes sites with false genotypes and results in little loss of truly interesting data. However, highly covered sites are rarely strictly monoallelic even in a homozygous state due to rare errors in sequencing and alignment (Figure S7b in Additional file 8). Thus, we propose a genotype error filter where the average amount of such sequencing noise per sample is first estimated from alleles other than reference (REF) or alternative (ALT) (Figure S7c in Additional file 8). Then, binomial testing is used to estimate if the counts of REF/ALT alleles are significantly higher than this noise, and sites where homozygosity cannot be thus rejected are flagged as possible errors (Fig. 4b). Additionally, it may be desirable to flag fully monoallelic sites with low total counts, where homozygosity cannot be significantly rejected, but heterozygosity is not supported either. This test can also be applied to study designs with RNA-seq data from multiple samples (e.g., tissues or treatments) of a given individual, genotyped only once, since genotyping error causes consistent monoallelic expression in every tissue. In the Geuvadis data set with 1000 Genomes phase 1 genotypes and sites covered by eight or more reads, an average of 4.3 % of sites per sample are excluded by these criteria [1 % false discovery rate (FDR)].
Unfortunately, genotyping error is very difficult to distinguish from a true biological pattern of strong monoallelic expression, shared across all studied tissues, and present in a small number of samples, such as analysis of nonsense-mediated decay triggered by a rare variant, or a rare severe regulatory mutation (Fig. 1). The only real solution is rigorous genotype quality control and/or validation, and taking the possibility of confounding by genotyping error into account in interpretation of the results.
Sample mislabeling or mixing of the RNA-seq samples can lead to a substantial number false positive hits — as opposed to reduction of power in eQTL studies. Fortunately, simple metrics from AE analysis provide a sensitive way to detect sample contamination and mislabeling [40]. DNA-RNA heterozygous concordance — i.e., the proportion of DNA-heterozygous sites that are heterozygous also in RNA data — and a measure of allelic imbalance detect outliers and indicate the type of error (Figure S7d in Additional file 8).
Technical covariates
RNA-seq has become a mature and highly reproducible technique, but it is not immune to technical covariates such as the laboratory which experiments were performed in, aspects of library construction and complexity, and sequencing metrics [40]. Gene expression studies are particularly susceptible to these technical factors, because read counts between samples are compared. AE analysis has the advantage that only read counts within samples are compared (allele versus allele), which makes it less susceptible to technical artifacts. We analyzed the correlation of the proportion of significant AE sites (binomial test, nominal p < 0.05) with various technical covariates in the Geuvadis data (Fig. 5a). In raw AE count data, we observe a high correlation with the library depth (unique reads; R2 = 0.24) — expectedly, since total read count of AE sites determines the statistical power to see significant effects (see below). In AE data corrected for variation in read counts by scaling the counts to 30, all technical correlations are very small and mostly non-significant, in stark contrast to gene expression level data that display strong batch effects (Fig. 5b). Thus, when appropriate measures are taken, AE analysis is an extremely robust approach that suffers less from technical factors than gene expression studies.
Statistical tests for AE
A binomial test is the classic way to determine whether the ratio of the two alleles is significantly different from the expected 0.5, and has been widely used [2, 5, 8, 31]. However, AE data are overdispersed compared with what is expected under a binomial distribution, likely as a result of both biological and technical factors [35, 41, 42]. These technical factors arise from systematic artifacts such as allelic mapping bias, as well as from imperfect reproducibility (measurement error), which we were able to estimate using eight technical replicates of five Geuvadis samples [40]. Accounting for duplicates and overlapping read mates reduced measurement error between replicates (Additional file 9), with very low level of residual variation between replicates except for the highly covered sites (>500), although we note that this may not apply to all data sets. The other QC measures described above remove systematic artifacts and reduce the inflation of binomial p values further (Fig. 6a). Nonetheless, the binomial p values remain inflated, and especially highly covered sites are likely to have remaining systematic artifacts (Fig. 6b). This suggests that a simple binomial test may not be an appropriate statistical test for allelic imbalance because it could result in a high number of false positives. However, given that most genes have eQTLs [4, 5, 8], biological sources of AE are expected to be extremely widespread, which is further supported by high heritability of AE [2]. Thus, while various statistical models have been put forward, many of which use variations of a beta-binomial model to infer the level of overdispersion [35, 41, 42], it remains inherently difficult to distinguish biological sources of overdispersion from putative technical effects. One approach is to analyze AE across individuals and tissues to control for confounders and capture the biological signal of interest — such as cis-regulatory variation [35, 41], imprinting [13], or nonsense-mediated decay [20]. However, many of the statistical approaches to analyze AE data are just emerging, and their full benchmarking is beyond the scope of this paper. For reference, a list of the currently available tools and publications that analyze AE data, including their specific biological application, statistical test used, and required inputs, can be found in Additional file 10.
Often during AE analysis the intent is to compare allelic imbalance between different sites, or between individuals. This is complicated by the highly variable total read counts at het-SNPs (Fig. 2a), since they lead to substantial differences in statistical power at different sites. These differences are driven by differences in library depth between samples, as well as biologically variable expression levels between genes and samples. Such differences can cause samples to cluster by experimental batch (Fig. 6c). If the goal of the analysis is to capture AE, patterns introduced by expression levels are often not desirable. While this problem ultimately needs to be addressed with tailored statistical approaches, it can be alleviated with a straightforward minimum effect size cutoff that reduces the enrichment of significant sites in highly covered het-SNPs (Fig. 6b), and accounts for the strongest dependency of total read counts (Fig. 6d). An experimental approach is to use an assay that yields high read counts, such as mmPCR-seq, instead of or alongside RNA-seq data [9, 12, 13, 43].
QC measures improve the power to detect biologically relevant AE
Regardless of the specific application, the QC measures proposed here should increase true signals of AE, resulting in improved power to detect biological phenomena of interest. To demonstrate this, we analyzed AE at 1154 genes with known eQTL (eGenes) in 343 European individuals using Geuvadis LCL RNA-seq data [5]. Individuals who are heterozygous for an eQTL SNP (eSNP) are expected to show increased AE within the eGene compared with those who are homozygous. Applying QC measures increased the significance of the difference in AE and reduced the variance of AE at eGenes (Additional file 11). Altogether this increased the power to distinguish between AE levels in eSNP heterozygous versus homozygous eGenes, with a 6.8 % increase in true positives, and 59.3 % decrease in false positives after applying QC measures (Fig. 7a, b). The measures also significantly increased the difference in the proportion of individuals exhibiting allelic imbalance (AE > 0.25) between the two classes (Fig. 7c), and resulted in a robust enrichment of sites within heterozygous eQTL across the spectrum of allelic imbalance (Fig. 7d). These results clearly illustrate the immediate benefit of ensuring AE data used for analysis are of high quality by applying the QC measures outlined here.