Gene-level differential analysis at transcript-level resolution

Compared to RNA-sequencing transcript differential analysis, gene-level differential expression analysis is more robust and experimentally actionable. However, the use of gene counts for statistical analysis can mask transcript-level dynamics. We demonstrate that ‘analysis first, aggregation second,’ where the p values derived from transcript analysis are aggregated to obtain gene-level results, increase sensitivity and accuracy. The method we propose can also be applied to transcript compatibility counts obtained from pseudoalignment of reads, which circumvents the need for quantification and is fast, accurate, and model-free. The method generalizes to various levels of biology and we showcase an application to gene ontologies. Electronic supplementary material The online version of this article (10.1186/s13059-018-1419-z) contains supplementary material, which is available to authorized users.


Converting transcript counts to gene abundances
Consider a gene with two transcripts t 1 and t 2 of lengths l 1 and l 2 and a two-condition experiment where X i 1 reads originate from transcript t 1 in experiment i (i ∈ {1, 2} and X i 2 (i ∈ {1, 2} reads originate from transcript t 2 . The current recommended approach to gene-level differential analysis is to compute a gene abundance for each condition as follows: The sum X i 1 l 1 + X i 2 l 2 computes gene abundance as the sum of transcript abundance, and the factor l 1 +l 2 2 produces a "gene count" to be used in differential expression methods such as DESeq2 (Love et al., 2014) that perform shrinkage. The formula may contain extra constants related to sequencing depth. Note that Y 1 Y 2 ∝ X 1 1 l 2 + X 1 2 l 1 X 2 1 l 2 + X 2 2 l 1 . ( This estimate of log-fold change avoids the problems discussed in (Trapnell et al., 2013). However in terms of variance estimation, the "gene count" Y i ∝ X i 1 l 2 + X i 2 l 1 can be problematic when l 1 = l 2 . Specifically if (without loss of generality) l 1 < l 2 then variance in X i 1 may be amplified by l 2 , an undersirable property since there is no reason why the variance contribution of one transcript should depend on the length of another.

2Šidák aggregation
Let T be the set of transcripts and G the set of genes. For each g ∈ G, denote the number of transcripts in g by |g| = |{t : t ∈ g}|. Let P t be the uniform random variable denoting the p-value output by a transcript-level differential analysis method under the null hypothesis of no differential abundance of transcript t. For a specific experiment, let p tg denote the p-value obtained from testing for differential abundance of transcript t in gene g by that method. Let m g = min t∈g p tg and set u g = 1 − (1 − m g ) |g| .
For each g ∈ G, let H g be the null hypothesis that no transcript t ∈ g is differentially abundant. Assume that the hypotheses H g are independent, and additionally that for each g the hypotheses resulting in the values p tg are independent (i.e. the random variables {P t } t∈g are independent). Claim 1. The Benjamini-Hochberg step-up procedure applied to the values {u g } g∈G controls the FDR for the |G| null hypotheses {H g } g∈G .
Proof: Denote by M g the p-value that under the null hypothesis H g the smallest among the {P t } t∈g is less than or equal to m g . Note that ( Therefore under the null hypotheses the values u g are uniformly distributed. Given a significance threshold α, the Benjamini-Hochberg step-up procedure identifies the largest k such that if the u g are ordered from smallest to largest as u 1 ≤ u 2 ≤ u 3 ≤ · · · then The procedure then rejects the null hypotheses corresponding to u 1 , . . . , u k . The assumption that the hypotheses H g are independent guarantees that Benjamini-Hochberg procedure has a false discovery rate bounded by α. Note that for small values of m g , 1 − (1 − m g ) |g| ≈ |g|m g by Taylor approximation. The use of |g|m g instead of u g would be equivalent to a Bonferroni correction at the gene-level prior to the Benjamini-Hochberg procedure. As described, the u g can be viewed as aŠidák correction (Šidák, 1967).
In the DEXSeq program (Anders et al., 2012) there is a similar approach that is implemented for aggregation, but that is slightly different. Instead of applying theŠidák correction to the minimm p-value from each gene and sorting genes according to the u g values, in DEXSeq genes are sorted directly according to the m g values. While the DEXSeq approach to controlling the false discovery rate is correct (Reyes et al., 2012), the ordering according to m g is not equivalent to the ordering according to u g and has a drawback. Specifically, genes with many isoforms are more likely, by chance, to produce a small p-value in one of their isoforms, and this has two consequences. First, the most highly ranked genes (i.e. smallest value of m g ) will tend to have more isoforms, and second, while DEXSeq controls the FDR there are more likely to be false positives with small m g values rather than small u g values, so that the "FDR budget" is consumed more quickly.
When simulating p-values according to a null distribution (i.e. each transcript in the genome receives a p-value that is uniformly distributed) for the Mus musculus transcriptome (GRCm38 release 88), we found that the average number of isoforms among the top 100 ranked genes according to m g is 6.89 (median = 5) versus 2.77 for u g (median = 1). For context, the average number of transcripts per gene in the Mus musculus transcriptome is 3.1 with a median of 1.

Applicability of the Lancaster method to RNA-Seq
Transcripts of the same gene are not in general independent from each other since they may be biological co-regulated and are quantified from the same pool of reads. Therefore, we performed a series of experiments demonstrating that the extent of transcript dependence is limited and that the assumption of independence in performing the Lancaster method is a reasonable approximation. Furthermore we show that the application of the Lancaster method to aggregating transcript p-values does not inflate the false discovery rate (FDR) or false positive rate (FPR).

Independence of transcript p-values
We performed an in silico experiment to measure distribution of p-values under the null hypothesis. To construct an instance of the null hypothesis, we chose six samples randomly from within the GEUVADIS set of samples performed on Finnish women and performed a 3v3 differential expression analysis using sleuth on transcripts. A pair of transcript pvalues was randomly selected from each gene and scatter-plotted (Supplementary Figure  8), showing that transcript p-values appear linearly uncorrelated (Pearson correlation of 0.064), and other than a point of high density of low-low p-value pairs, appear uniform and independent by first-order visualization.
To test directly for independence, a chi-square test was performed on a contingency table constructed on these p-value pairs by binning p-values uniformly in 0.1 intervals. The chi-square test resulted in a p-value of 7.5e−8, supporting the idea that p-values of isoforms within a gene are not independent, which is expected. However, when p-values within the (0, 0.1) interval are excluded from the chi-square test, the chi-square test for independence resulted in a p-value of 0.74, consistent with what we noticed upon visualization and demonstrating that transcript p-value dependence is largely limited to small p-value regimes but is otherwise independent.

Control of false positive rates and false discovery rates
One reasonable critique of applying the Lancaster method to RNA-Seq is that even weak dependence between transcripts under the null hypothesis leads to exaggerated gene p-values and an inflated false positive rate (FPR) and false discovery rate (FDR), especially when small transcript p-values are aggregated. To address this critique, we performed an experiment to test the effect of transcript dependence on the FPR and FDR under the null hypothesis. In each trial of our experiment, we randomly chose six samples from the same batch within the Finnish women GEUVADIS samples and compared the numbers of false positives (p-value < 0.05) and false discoveries (q-value < 0.05) found in gene mode compared the numbers found with the Lancaster method using sleuth. We performed 20 such trials and tabulated the results in Supplementary Figure 9. In all 20 trials, the number of false positives were reduced with Lancaster aggregation (paired t-test, p-value = 6.5e−6), and in 14 trials, the number of false discoveries were reduced (paired t-test, p-value = 0.141). The null simulation results demonstrate that the Lancaster method resulted in a reduction of the FPR and FDR, not an inflation, and are consistent with the findings in the perturbation simulation results (Figure 3, Supplementary Figures 1-3) that the Lancaster method reported more accurate and conservative FDRs than other methods.

Additional Supplementary Figures
Supplementary Figure 4: Log-log scatterplot of the p-values from a standard gene-level analysis in sleuth and the p-values from Lancaster aggregating sleuth-derived TCC p-values. The dexamethasone dataset was used, with p-values corresponding to the significance that the gene was differentially expressed in dexamethasone vs. vehicle treatment. 2772 differential genes (FDR < 0.05) were discovered by standard gene analysis compared to the 4336 significant genes (FDR < 0.05) discovered by Lancaster aggregation on TCCs.  Figure 9: Comparisons of the number of false positives and discoveries obtained by sleuth gene mode versus Lancaster aggregation on sleuth-derived transcript p-values. Each of the twenty null hypothesis trials is simulated by performing 3 vs. 3 differential expression analyses in sleuth of batch-corrected samples from Finnish women in the GEUVADIS dataset.

Enrichment of 'Immune' GO Terms in GO Analysis
Supplementary Figure 10a. Performance of the likelihood ratio test on the experimental effect simulation. Instead of the Wald test, sleuth was invoked using the likelihood ratio test in gene mode ('sleuth -Gene') and on transcripts. Transcript p-values from the likelihood ratio test were aggregated with the Sidak correction ('sleuth -Sidak') or the Lancaster method ('sleuth -Lancaster Tx'). For sake of comparison, 'sleuth -Lancaster TCC,' which is always performed with the likelihood ratio test, is displayed.
Supplementary Figure 10b. Performance of the likelihood ratio test on the experimental effect simulation (zoomed-in).