Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data

Assefa, Alemu Takele; De Paepe, Katrijn; Everaert, Celine; Mestdagh, Pieter; Thas, Olivier; Vandesompele, Jo

doi:10.1186/s13059-018-1466-5

Research
Open access
Published: 24 July 2018

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA-sequencing data

Alemu Takele Assefa ORCID: orcid.org/0000-0002-7773-0621¹^na1,
Katrijn De Paepe⁴,
Celine Everaert³,
Pieter Mestdagh³,
Olivier Thas^1,2^na1 &
…
Jo Vandesompele³^na1

Genome Biology volume 19, Article number: 96 (2018) Cite this article

15k Accesses
35 Citations
26 Altmetric
Metrics details

Abstract

Background

Long non-coding RNAs (lncRNAs) are typically expressed at low levels and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 25 pipelines for testing DE in RNA-seq data is comprehensively evaluated, with a particular focus on lncRNAs and low-abundance mRNAs. Fifteen performance metrics are used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets.

Results

Gene expression data are simulated using non-parametric procedures in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, results for mRNA and lncRNA were tracked separately. All the pipelines exhibit inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and benchmark RNA-seq datasets. The substandard performance of DE tools for lncRNAs applies also to low-abundance mRNAs. No single tool uniformly outperformed the others. Variability, number of samples, and fraction of DE genes markedly influenced DE tool performance.

Conclusions

Overall, linear modeling with empirical Bayes moderation (limma) and a non-parametric approach (SAMSeq) showed good control of the false discovery rate and reasonable sensitivity. Of note, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in realistic settings such as in clinical cancer research. About half of the methods showed a substantial excess of false discoveries, making these methods unreliable for DE analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, giving guidance on selection of the optimal DE tool (http://statapps.ugent.be/tools/AppDGE/).

Background

Messenger RNA (mRNA) has been the primary target of transcriptome studies. However, RNA sequencing technology has revealed that the human genome is pervasively transcribed, resulting in thousands of novel non-coding RNA genes. Hence, attention is expanding to one of the most poorly understood, yet most common RNA species: long non-coding RNAs (lncRNAs) [1, 2]. These lncRNAs form a large and diverse class of transcribed RNA molecules, constituting up to 70% of the transcriptome with a defined length of 200 nucleotides. While they do not encode proteins, lncRNAs are strong regulators of gene expression [3]. The discovery and study of lncRNAs are of major relevance to human health and disease because they represent an extensive, largely unexplored, and functional component of the genome [3,4,5]. In contrast to mRNAs, lncRNAs are generally expressed in low amounts, typically an order of magnitude lower than mRNA expression levels [2, 6, 7]. Furthermore, several studies [7,8,9] demonstrated that lncRNA expression levels are very noisy, which is a characteristic shared with low count data from massively parallel RNA sequencing.

Following the advent of RNA-sequencing (RNA-seq) technologies, several statistical tools for differential gene expression (DGE) analysis have been introduced. However, low and noisy read counts, such as those coming from lncRNAs, are potentially challenging for the tools [10, 11]. For example, it is commonly observed that low count genes show large variability of the fold-change estimates and thus exhibit inherently noisier inferential behavior. The majority of the methods suggest removal of low expressed genes before the start of data analysis, but this procedure essentially blocks researchers from studying lncRNAs. In our study, no such severe filtering was applied, leaving almost all lncRNAs in the dataset. To our knowledge, no statistical method has been specifically developed for the analysis of lncRNA-seq data and therefore transcriptome studies make use of statistical methods that assume sufficient expression levels. In this paper, we evaluated and compared the performance of many popular statistical methods (Table 1) developed for testing DGE of RNA-seq data (hereafter referred to as “DE tools”), with special emphasis on lncRNAs and low-abundance mRNAs. All tools considered in this study are popular (in terms of number of citations), available as R software packages [12], and use gene or transcript level read counts as input. Our conclusions are based on six RNA-seq datasets and many realistic simulations, representing various typical gene expression experiments.

Table 1 List of DE tools and pipelines along with their reference and number of citations

Full size table

Previous comparative studies of DE tools [11, 13,14,15,16,17,18] focused on mRNA and some of these concluded that DE tools show inferior performance for genes or transcripts with low counts. We extend previous studies by including lncRNAs and low expressed mRNAs separately. Further, our results are based on diverse types of RNA-seq datasets that vary with respect to their biological and technical features, such as species (human, mouse, rat), experimental design (control versus treatment, diseased versus non-diseased, and tissue comparisons), and level of biological variability. We assess the degree of concordance among results returned from the DE tools, and we study important statistical properties of DE tools, such as their ability to control the false discovery rate (FDR) and their sensitivity for the detection of DE. The latter are empirically investigated using a non-parametric resampling-based simulation procedure. The simulation method essentially resamples data from a real RNA-seq dataset to create realistic gene expression scenarios. Consequently, our results reflect the genuine behaviour of the DE tools under study, in contrast to simulation studies based on parametric assumptions. Note that the evaluation of a method that relies on a parametric assumption (e.g., edgeR, DESeq, and DESeq2 assume a negative binomial distribution) by means of simulated counts using the same distribution as used in [14] will give too optimistic results. Moreover, these results do not reflect a realistic setting because the distributional assumption cannot be expected to hold in general [19]. By starting from a variety of real and representative RNA-seq datasets, the scope of our findings is wide. To our knowledge, our study is the largest empirical evaluation conducted so far in terms of the number of real datasets used, the number of performance metrics evaluated, and the number of DE pipelines included (Additional file 1: Figure S1).

Our study consists of four parts: first we evaluated various normalization procedures; second we compared the level of agreement among DE pipelines using various publicly available RNA-seq datasets; third we explored the ability of the DE pipelines to recover known evidence of differential expression; and fourth we used simulation procedures to evaluate and compare the performance of the tools under a variety of gene expression experiment scenarios (Fig. 1), such as variability, sample size, and fraction of DE genes.

Results and discussion

RNA-seq datasets

Six publicly available benchmark RNA-seq datasets were used for the concordance analysis. Three of them were used as source datasets for generating non-parametric simulated data. The description of the datasets can be found in the “Methods” section; a summary is presented in Table 2.

Table 2 Summary of datasets

Full size table

The degree of homogeneity among samples, as measured by Pearson’s correlation coefficient, was lowest for the Zhang dataset followed by GTEx (see also the estimated biological coefficients of variation in Additional file 1: Figure S2). As expected, the other datasets had replicates that are more homogeneous because they were obtained from inbred animals or cultured cell lines, in contrast to the GTEx or Zhang datasets containing tissues for different human individuals. For the Zhang and NGP nutlin datasets, lncRNAs showed relatively higher heterogeneity across samples than mRNAs. In addition, lncRNAs showed, on average, lower expression than mRNAs (Additional file 1: Figure S3).

Comparison of normalization methods

Comparing DE tools requires careful attention to the normalization methods. Previous studies [13, 16, 20, 21] have pointed out that the normalization procedure can affect DE results. The aim of our study is not to perform a comprehensive comparison of all normalization methods. Instead, we compared five normalization methods that are used in conjunction with the DE methods evaluated in this study. This will allow us to better understand the general behavior of the DE tools as evaluated in the subsequent sections. The normalization methods were compared using the metrics from Dillies et al. [20], such as their capability to reduce technical variability and to eliminate bias due to library size differences, and their effect on DGE analysis.

Box plots of the relative log expressions show that for all six datasets all normalization methods succeed in aligning the sample-specific distributions and hence no library size effects were noticeable after normalization (Additional file 2: Section 2.2). Furthermore, the condition-specific gene-wise coefficient of variation (CV), which is a proxy for intra-group variability, was lower for all datasets upon normalization (Fig. 2b and Additional file 2: Section 2.3). Nearly equal levels of biological variability across methods were observed, even with quantile normalization that was found to result in high CV in other studies [20, 22]. The overlap of DE genes with different normalization methods was high (Fig. 2a and Additional file 2: Section 2.4). Ignoring quantile normalization (QN), on average (across the six dataset) a minimum of 86% similarity was observed. QN-based DE analysis gives deviating results, particularly for designs with small numbers of replicates (< 5); the average minimum proportion of similarity was 70.1% (average minimums are calculated across datasets). Overall, the results suggest that all normalization methods perform almost equally, except QN. Nevertheless, for the concordance analysis of the DE tools (see next section) we include a limma pipeline that uses QN (named limmaQN) to further investigate its effect on other performance metrics of DE tools.

Concordance analysis

Twenty-five DE pipelines were run on six RNA-seq datasets, and (dis)similarities among the results were examined. The concordance analysis focused on five quantitative and one qualitative metric: (1) number of genes identified as significantly differentially expressed (SDE); (2) similarity in terms of the set of SDE genes; (3) the degree of agreement on gene ranking; (4) similarity of fold-change estimates; (5) handling of genes with special characteristics (lncRNAs, genes with low counts, genes with outliers); and (6) computation time. The results for individual datasets are presented in Additional file 3.

Results show that the pipelines show substantial variability in the numbers of SDE genes. The marginal summary across all datasets indicates that DESeq, NOISeq, baySeq, and limmaQN detected the smallest number of SDE genes, whereas QuasiSeq and SAMSeq returned the largest numbers (Fig. 3). The variability among DE pipelines with respect to the number of SDE genes seems to be related to the biological variability in the dataset. For the Zhang and GTEx RNA-seq datasets, characterized by the largest intra-group biological variability, the numbers of SDE genes were quite different among the DE pipelines. In contrast, the numbers of SDE genes from the NGP nutlin and CRC AZA datasets, all displaying low biological variability, were relatively similar among pipelines. lncRNAs and low-abundance genes in general were under-represented among the SDE genes (Additional file 3). For example, 25% of the SDE genes were lncRNAs, whereas the data contain 40% lncRNAs.

Many of the DE pipelines showed agreement to each other in terms of the set of SDE genes (Fig. 3). On average, NOISeq, limmaQN, DESeq, baySeq, and SAMSeq showed the smallest concordance with all other tested pipelines. It was also observed that the overlap of SDE is smaller for lncRNAs than for mRNAs (Additional file 1: Figure S4). In the Zhang dataset, there is less than 70 and 60% SDE overlap across all DE tools for mRNAs and lncRNAs, respectively.

Accurate gene ranking is an essential step for downstream analysis such as gene set enrichment analysis (GSEA) [23]. The degree of agreement among the 25 DE pipelines’ gene ranking is studied using the rank of π scores; taking into account both the significance and magnitude of differential expression [24]. Summarized results across datasets (Fig. 3) indicate that all pipelines strongly agree, except for baySeq, NOISeq, SAMSeq, and limmaQN. Apart from baySeq, this is somewhat in contrast to the findings in Soneson and Delorenzi [14]. This might be due to the difference in the score used to rank genes, as only p values were used to rank genes in Soneson and Delorenzi [14]. Except for limmaQN, gene ranking agreement among all pipelines was nearly the same for lncRNAs and mRNAs from analyzing the NGP nutlin data. A slightly lower agreement for lncRNAs was observed when the most variable dataset (Zhang) was used (Additional file 1: Figure S4).

Moreover, the log fold-change (LFC) estimates from all DE tools were strongly correlated, with a minimum of 0.8 Pearson correlation coefficient (on average) for limmaVst, limmaQN, and limmaTrended pipelines (Fig. 3 and Additional file 3). However, the correlations become relatively stronger for the datasets with higher numbers of samples per group. In addition, the correlations for lncRNAs were lower than for mRNAs (Additional file 1: Figure S4 and Additional file 3: Sections 5.4 and 6.4).

In addition, we qualitatively examined the handling of genes with outlier expression (Additional file 1: Section 3.1). A set of genes with outlier count in only one of the samples (from the Zhang data) was chosen (Additional file 1: Figure S5). The adjusted p values for these outlier genes shows that edgeR exact, edgeR GLM, edgeR QL, PoissonSeq, QuasiSeq (both settings), and baySeq declared most of them SDE at 5% nominal FDR (Additional file 1: Table S2), suggesting that they can be affected by outlier expression.

To come to an overall conclusion, the results were combined in a hierarchical clustering analysis of the DE pipelines, resulting in 4 clusters (Fig. 3). DESeq, baySeq, limmaQN, and NOISeq cluster together, generally showing the lowest number of SDE genes, lower overlap, and lower gene ranking agreement with all other DE pipelines. The second cluster includes edgeR exact, edgeR GLM, edgeR QL, DESeq2 (both settings), and limmaVoom (robust and not robust), showing the highest concordance with respect to calling SDE, gene ranking, and LFC estimates. Pipelines in this cluster generally identify more SDE genes than methods in the first cluster. LimmaTrended (robust and not robust) and limmaVst appear in a separate cluster because of their relatively weakly correlated LFC estimates with that of other pipelines, but these pipelines strongly resemble the second cluster with respect to the other concordance metrics. The last cluster includes QuasiSeq (both settings), edgeR robust (with both tested prior degrees of freedom), limmaVoom+QW, PoissonSeq, and SAMSeq. They detect the most SDE genes and show a modest proportion of overlap, gene ranking agreement, and LFC similarity.

Moreover, with respect to identifying DE genes among genes that are detected only in one group of samples, DESeq, baySeq, and PoissonSeq fail to estimate a meaningful fold change. On the other hand, edgeR exact test, DESeq, and SAMSeq return no p value for such genes with a low signal-to-noise (STN) ratio (Additional file 1: Section 3.2). STN is defined as the ratio of the mean to the standard deviation of normalized counts in the group with detected gene expression [13]. In general, and not unexpectedly, all the pipelines assign significant p values for such genes with a high STN ratio (Additional file 1: Figure S6). This suggests that researchers need to be cautious when interpreting the DE results, particularly when the 0 read counts in one of the groups is likely caused by technical artefacts. Moreover, for lncRNAs (also for low-abundance mRNAs), the STN ratio is typically low, and hence all the DE pipelines fail to detect true DE among such genes. However, from the relationship between the STN and adjusted p values, one can learn that limma and QuasiSeq tools (and edgeR robust and DESeq2 to a lesser extent) detect such genes as SDE even at low STN (Additional file 1: Figure S6).

Results obtained with the three settings of DESeq2 were not markedly different, except that the independent filtering excluded more lncRNAs (29% from the Zhang data) than mRNAs (Additional file 1: Figure S7). Among the seven limma pipelines, voom and trended (with and without robust estimate of the prior degrees of freedom) showed relatively better concordance. In addition, voom with sample quality weight (limmaVoom+QW) tend to identify more SDE genes. Similarly, edgeR pipelines attained similar concordance except that edgeR robust detects slightly more SDE genes than the average. Although the three QuasiSeq pipelines cluster together, the quasi-likelihood (QL) method with an independent estimate of the gene-wise QL dispersion showed worse agreement in terms of the set of SDE genes.

The computation time to run DGE analysis presented in Additional file 1: Figure S8 shows that baySeq and DESeq require the longest time, whereas limma tools and PoissonSeq run fast. For RNA-seq data with ten replicates per group and 19,150 mRNAs, the slowest tools, baySeq and DESeq, were approximately 8000 and 2000 times slower than the fastest pipeline, limmaQN, respectively.

Recovering biological truth

In addition to the concordance analysis, we also assessed the capability of the DE tools to recover genes with known biological evidence of DE in the benchmark datasets. To this purpose, results from three published studies were used to define the truth: genes with gender-biased expression [25], MYCN regulated genes [26], and TP53 pathway genes [27] (see “Methods” for description). The ability to recover the truth is evaluated using four metrics: number of recovered genes, similarity among DE pipelines in terms of the set of recovered genes, gene classification agreement with the truth, and GSEA. Detailed results can be found in Additional file 4.

Despite the challenge of defining biological truth, several pipelines show relatively good performance in recovering the known truth, definitely when considering that the experimental conditions are not identical in the benchmark studies and the truth studies. However, in terms of the number of recovered genes and the degree of similarity to each other, the pipelines show substantial variation. In line with the concordance analysis, conservative tools (DESeq, baySeq, and NOISeq) recovered a relatively lower number of genes with low similarity to other tools (Additional file 4: Figure S8). In contrast, tools such as SAMSeq and PoissonSeq that were categorized as liberal (highest number of SDE genes) according to the concordance analysis now ranked generally low in recovering the biological truth across the three control studies and exhibited the least agreement with other pipelines. Across the four metrics assessing biological truth, DESeq2 (both settings), edgeR (robust), and limma (voom+QW, voom, and trended) outperformed all other tools, whereas PoissonSeq, SAMSeq, NOISeq, DESeq, and QuasiSeq (both settings) showed inferior capability.

Simulation results

The non-parametric SimSeq [28] procedure was applied to realistically simulate RNA-seq expression data. The simulation technique involves sub-sampling of replicates from a real RNA-seq dataset with a sufficiently large number of replicates. In this way, the underlying characteristics of the source dataset are preserved, including the count distributions and variability. The representativeness of the simulated data was examined using various quality metrics, including those proposed by Soneson and Robinson [29] (see “Methods” section). Three series of simulations were performed, each starting from a different RNA-seq source dataset: Zhang, NGP nutlin, and GTEx data. The degree of homogeneity among the replicates in these datasets varies, reflecting different levels of intra-group biological variability (Table 2 and Additional file 1: Figure S2). The Zhang and NGP nutlin datasets include annotated lncRNAs along with mRNAs, whereas the GTEx RNA-seq dataset contains only annotated mRNA genes. Therefore, simulated counts for mRNA and lncRNA are sampled from mRNA and lncRNA counts of the source dataset, respectively.

Gene expressions were simulated under a wide range of scenarios that may affect the performance of DE tools: different numbers of replicates ranging from 2 to 40, different proportions of true DE genes (0 to 30%), two gene biotypes (mRNA and lncRNA), and different levels of intra-group biological variability (as present in the three source datasets). From the simulation results, the actual FDR, true positive rate (TPR), and false positive rate (FPR) were computed for each DE pipeline. The comparison between the two gene biotypes was done in two ways: simulating lncRNA data only or simulating lncRNA and mRNA jointly, but analyzing separately.

False discovery rate and true positive rate

FDR refers to the average proportion of incorrect discoveries among SDE genes (genes identified as DE at a particular nominal FDR threshold). A good DE tool has actual FDR close to the nominal level, and has high TPR. The TPR, also known as sensitivity, is the average proportion of SDE genes among the true DE genes. The TPR should be sufficiently large, otherwise one cannot expect to find many of the true DE genes. Therefore, it is customary to look for a DE pipeline that has the highest TPR among those that control FDR (i.e., actual FDR is close to nominal FDR). The FDR versus TPR curve is used to compare the performance of DE pipelines at various nominal FDR threshold (ranging from 0 to 100%).

Results from the first simulation (starting from the Zhang data) generally indicate that the FDR is not controlled well by many DE pipelines (Fig. 4). Among the pipelines that control the FDR relatively well, many have a small TPR. Besides the gene biotype (mRNA versus lncRNA), the performance is correlated with the level of the intra-group variability, the number of replicate samples, and the fraction of DE genes. Many DE tools show severe FDR inflation and slightly lower TPR when only a small fraction of genes is DE (Additional file 1: Figures S9 and S10). The actual FDR may even exceed 50%, which means that more than half of the called SDE genes may be false discoveries. For most DE tools, better FDR control and higher sensitivity were attained with increasing number of replicates (Fig. 4 and Additional file 1: Figures S11 and S12). Performance of all DE pipelines is considerably poorer for lncRNAs than for mRNAs (Figs. 4 and 5). However, very similar results (poor performance in terms of FDR control and TPR) were obtained for low-abundance mRNAs based on a simulation starting from the GTEx data (Additional file 1: Figure S13).

For the simulation that started from the (homogeneous) NGP nutlin data, the results were better (Fig. 5), with good FDR control and high TPR for all DE tools, even for small numbers of replicates. Only for simulations with 5% of true DE genes was the FDR control lost (Additional file 1: Figure S10). The difference in performance between the Zhang and NGP nutlin simulations can be explained by their intra-group variability (Table 2 and Additional file 1: Figure S2): the NGP nutlin data come from cell line replicates that are characterized by low biological variability. For the simulations starting from the GTEx dataset, which has intermediate biological variability, the performance of the DE tools is somewhere in between those for the Zhang and NGP nutlin datasets (Additional file 1: Figure S14).

Because of the trade-off between FDR and TPR, a high TPR is expected for DE tools with a high actual FDR. This was observed for edgeR, DESeq2, and QuasiSeq pipelines, particularly for small numbers of replicates (Fig. 4). limma and SAMSeq showed better FDR control, while retaining a high TPR. Their better performance is true for both biotypes with at least ten and four samples per group for the Zhang and NGP nutlin simulations, respectively (Additional file 1: Figures S11 and S12). DESeq, PoissonSeq, and NOISeq showed better FDR control, but at a cost of severe TPR loss.

Among the seven edgeR pipelines, edgeR robust showed generally better performance for the Zhang data simulations (Additional file 1: Figure S15). However, only a small difference was observed in the simulation that starts with the less variable NGP nutlin data. edgeR robust with data-specific prior degrees of freedom seems more beneficial in maximizing the TPR. Only small performance variation was observed among the limma pipelines, except limmaQN, which deviated substantially (lower performance) in the second simulation (Additional file 1: Figure S16). This deviation may be due to the number of replicates, as only five samples were used in each group. Among all limma pipelines except limmaQN, voom with sample quality weight (limmaVoom+QW) lost control of FDR. Similarly, minor differences were observed among the DESeq2 pipelines (Additional file 1: Figure S17). However, as indicated in the concordance analysis, the independent filtering should be used carefully for lncRNAs. Similarly, among the QuasiSeq pipelines, the one with QL dispersion estimated independently for each gene, appeared to have worse performance (Additional file 1: Figure S18).

The simulation study demonstrated that large heterogeneity among samples has a potential to negatively affect the performance of DE tools, particularly leading to a failure to detect biological signals. The heterogeneity can result from both biological and technical factors. The technical artefacts can be alleviated by filtering low quality or aberrant samples that substantially contribute to the intra-group variability [30]. Such samples can be recognized by the sample-to-sample distances projected into a two-dimensional space using, for example, principal component analysis [10, 32]. This is confirmed by an extra simulation that starts from the Zhang data whereby the most distant (outlying) samples were excluded beforehand (Additional file 1: Section 4.2.3). The results generally indicate that DE tools perform better with respect to FDR control and sensitivity if outlying samples are excluded (Additional file 1: Figures S19 and S20).

Methods for controlling the FDR, for example, Benjamin and Hochberg (BH) [31], rely on the assumption that the raw p values have a flat distribution near p = 1. This assumption, however, might not always hold, especially for low-abundance genes such as lncRNAs and for small numbers of replicates. This concern is demonstrated by (1) a simulation with no DE genes, so that all p values correspond to the null hypothesis, and (2) using the p values from the DE results from the six benchmark RNA-seq datasets. For comparison purposes, the p value distributions from the analysis of a simulated dataset with 30% DE genes is also included. The p values associated with the null hypotheses are supposed to be uniformly distributed between 0 and 1. For datasets with a fraction of SDE genes, a spike near p = 0 and a flat distribution near p = 1 is expected if the DE tool works fine. For many DE pipelines, the observed p value distribution looks as expected (Additional file 1: Figures S21–S27 and Additional file 2). When the number of replicates is small, a slightly conservative p value distribution (a spike near p = 1) is noticeable for lncRNAs, and to a lesser extent for mRNAs. The underlining cause may be the high variability of lncRNAs. This may result in loss of power to detect true DE lncRNAs, as confirmed by our simulation study. Correct calibration of p values under the null hypothesis and a large sample size can overcome this issue. Overall, QuasiSeq, DESeq, edgeR (exact test), and limma tools (for small numbers of replicates) return p values that do not well satisfy the assumption of p value uniformity.

False positive rate

The FPR refers to the probability of calling a gene SDE in a scenario with no DE genes at all. FPR of DE tools was evaluated using a simulated RNAseq data with 0% DE genes (also known as mock comparison). Results shown in Additional file 1: Figure S28 demonstrate that all DE pipelines resulted in a FPR of less than 1%. The results were similar for both gene biotypes (mRNAs and lncRNAs), except for a slightly higher FPR for lncRNAs than for mRNAs. The FPR was generally larger for methods relying on the negative binomial distribution. This finding is in line with conclusions from a previous comparative study [13] in which it was concluded that the number of false predictions of differential expression from DE tools (most of these DE tools are also the part of our study) is sufficiently low even for genes with low counts (the lowest 25% expressed genes).

Simulation of lncRNA expression data only

Results presented up to this point came from simulating, normalizing, and analyzing lncRNAs and mRNAs together. Of note, joint analysis of the two gene biotypes may affect results. For example, estimates of gene-specific dispersion parameters for negative binomial models are often done by sharing information across all genes using empirical Bays strategy [32,33,34], and hence the results for lncRNAs depend on mRNA read counts and vice versa. In addition, adjusted p values aimed at controlling FDR are calculated taking into account the total number of genes included in the analysis [31]. Therefore, we also evaluated the performance of the DE tools with only lncRNA data, using the same simulation procedures. Our conclusions remain the same. The results are shown in Additional file 1: Figure S29. The FDR control is generally worse when analyzing lncRNA separately, particularly for small replicate sizes. Only a small reduction in TPR is observed.

Web application

All simulation results can be consulted and visualized with a web application [35].

Conclusions

The discovery and study of lncRNAs is of major relevance to human health and disease because they represent an extensive, largely unexplored, and functional component of the genome [3,4,5]. Several gene expression studies indicated that the expression of the majority of lncRNAs is characterized by low abundance [2, 7, 9], high noise [8], and tissue-specific expression [7]. These characteristics are very challenging for DE tools and may potentially negatively affect tool performance [10, 11]. Our study evaluated the performance of widely used statistical tools for testing DGE in RNA-seq data, with separate analysis of mRNA and lncRNA.

Under the assumption that there is no batch effect, all considered normalization methods perform equally well with respect to correcting the library size differences and reducing technical variability. However, QN tends to substantially deviate in terms of its effect on DGE analysis. This result was also confirmed by the poor performance of the limma DE pipeline that uses QN. Therefore, it is fair to suggest not to use QN to normalize RNA-seq data. Concordance analysis based on six diverse types of RNA-seq datasets demonstrated that the DE pipelines lack strong consensus on identifying a set of significantly differentially expressed genes, gene ranking, and fold-change estimates. For datasets with a small number of replicates and/or heterogeneous replicates, the disagreement is even worse. Lower concordance was observed generally for lncRNAs than for mRNAs. In particular, limmaQN, NOISeq, baySeq, and DESeq (also PoissonSeq and SAMSeq to a smaller extent) showed lower concordance with other DE tools. In contrast, edgeR, DESeq2, and limma (except limmaQN) tools showed better agreement with other DE tools and consistent characteristics across datasets. In terms of recovering known evidence of a biological truth, DESeq2 (both settings), edgeR (robust), and limma (voom+QW, voom, and trended) showed better capability than all other tools, whereas PoissonSeq, SAMSeq, NOISeq, DESeq, and QuasiSeq (both settings) showed inferior capability.

Results of the non-parametric simulation study revealed that there are substantial differences between methods with respect to FDR control and sensitivity. The DE tool performance is strongly affected by the sample size and biological variability. FDR control at the nominal level is good for all methods for datasets with small biological variability, even with only five biological replicates per condition. On the other hand, for datasets that are more variable, the FDR control is only guaranteed for larger sample sizes, and only with the following methods: PoissonSeq, limma tools, SAMSeq, and DESeq. All other DE methods result in actual FDR levels far above the nominal level, up to an FDR exceeding 50% for lncRNAs even when studying 40 replicates per condition. Differences between the tools in terms of sensitivity are not very large, except for PoissonSeq, NOISeq, DESeq, and limmaQN to a smaller extent, which showed the lowest sensitivities among all pipelines. For highly variable data, a maximum sensitivity of 50% for mRNAs is obtained with at least 40 replicates per condition. For homogeneous samples, this level of sensitivity is reached with four to five replicates per condition. In addition, we have demonstrated that the performance of DE tools can be improved by filtering aberrant samples that substantially contribute to the intra-group variability.

In the light of promoting reproducible science, it is essential to select a DE tool that succeeds in controlling the FDR level under a large range of conditions. Among these DE tools, one can select one with a high sensitivity. If a DE tool has an actual FDR far larger than the nominal level, many of the claimed discoveries will be false discoveries. If one accepts a large proportion of false discoveries, it is generally better still to use a DE tool with good FDR control, but to apply the method at a larger nominal FDR level. In this way, the researcher still controls the error rate. This reasoning implies that the selection of a DE tool may never rely purely on the sensitivity. High sensitivities may be expected from methods with a large actual FDR (and hence poor FDR control). These high sensitivities are illusive in the light of the large proportion of false discoveries.

Combining all results, we conclude that limma (with variance stabilizing transformation; voom with or without quality weighting; trend) and SAMSeq control the actual FDR reasonably well, while not sacrificing sensitivity. However, desirable performance is guaranteed only for a reasonably large number of replicates and for samples with low variability. Our results also indicate that accurate differential expression inference of lncRNAs requires more samples than that of mRNAs. Although we concluded that DE tools exhibit substandard performance for lncRNAs, the substandard performance of DE tools also applies to low-abundance mRNAs.