 Short Report
 Open Access
 Published:
Exaggerated false positives by popular differential expression methods when analyzing human population samples
Genome Biology volumeÂ 23, ArticleÂ number:Â 79 (2022)
Abstract
When identifying differentially expressed genes between two conditions using human population RNAseq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limmavoom, NOISeq, dearseq, and Wilcoxon ranksum test, we found that FDR control is often failed except for the Wilcoxon ranksum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for populationlevel RNAseq studies with large sample sizes, we recommend the Wilcoxon ranksum test.
Background
RNAseq is a transcriptome profiling approach using deepsequencing technologies [1,2,3]. Since RNAseq was developed over a decade ago, it has become an indispensable tool for genomewide transcriptomic studies. One primary research task in these studies is the identification of differentially expressed genes (DEGs) between two conditions (e.g., tumor and normal samples) [3]. This taskâ€™s longstanding, core challenge is the small sample size, typically two or three replicates per condition. Many statistical methods have been developed to address this issue by making parametric distributional assumptions on RNAseq data, and the two most popular methods of this type are DESeq2 [4] and edgeR [5]. However, as sample sizes have become large in populationlevel RNAseq studies, where dozens to thousands of samples were collected from individuals [6, 7], a natural question to ask is whether DESeq2 and edgeR remain appropriate.
Results and discussion
To evaluate the performance of DESeq2 and edgeR on identifying DEGs between two conditions, we applied the two methods to 13 populationlevel RNAseq datasets with total sample sizes ranging from 100 to 1376 (Additional file 1: Table S1). We found that DESeq2 and edgeR had large discrepancies in the DEGs they identified on these datasets (Additional file 1: Fig. S1). In particular, 23.71â€“75% of the DEGs identified by DESeq2 were missed by edgeR. The most surprising result is from an immunotherapy dataset (including 51 prenivolumab and 58 onnivolumab antiPD1 therapy patients) [8]: DESeq2 and edgeR had only an 8% overlap in the DEGs they identified (DESeq2 and edgeR identified 144 and 319 DEGs, respectively, with a union of 427 DEGs but only 36 DEGs in common). This phenomenon raised a critical question: did DESeq2 and edgeR reliably control their false discovery rates (FDRs) to the target 5% on this dataset?
From the literature, we found that several studies had reported the anticonservative behavior of DESeq2 and edgeR [9,10,11]; however, they were restricted to using simulated datasets with small sample sizes. Hence, for our largesamplesize scenario, their findings did not provide a direct answer to our question. Nevertheless, large sample sizes allowed us to generate permuted datasets to evaluate the FDRs without relying on model assumptions.
To answer this question, we first generated 1000 negativecontrol datasets by randomly permuting the twocondition labels (prenivolumab and onnivolumab) of the 109 RNAseq samples in this immunotherapy dataset (Methods). Since any DEGs identified from these permuted datasets are considered as false positives, we used these permuted datasets to evaluate the FDRs of DESeq2 and edgeR. Surprisingly, DESeq2 and edgeR had 84.88% and 78.89% chances, respectively, to identity more DEGs from the permuted datasets than from the original dataset (Fig.Â 1A). In particular, DESeq2 and edgeR mistakenly identified 30 and 233 genes as DEGs, respectively, from 50% permuted datasets (Fig.Â 1B). Even more, among the 144 and 319 DEGs that DEseq2 and edgeR identified respectively from the original dataset, 22 (15.3%) and 194 (60.8%) DEGs were identified from at least 50% of permuted datasets, suggesting that these DEGs were spurious (Fig.Â 1C). These results raised the caution about exaggerated false positives foundÂ by DESeq2 and edgeR on the original dataset.
What is more counterintuitive, the genes with larger fold changes estimated by DESeq2 and edgeR (between the two conditions in the original dataset) were more likely to be identified as DEGs by the two methods from the permuted datasets (Fig.Â 1D and Additional file 1: Fig. S2). This finding is consistent with a recent paper, which also reported that selecting the genes with the largest estimated differences between the two conditions would inflate the FDR [12]. As biologists tend to believe that these largefoldchange genes are more likely true DEGs (which is not necessarily true because a dataset may contain no true DEGs at all), the fact that these genes are false positives would likelyÂ waste experimental validation efforts.
Out of curiosity and as a means of verification, we investigated the biological functions of the spurious DEGs identified by DESeq2 or edgeR from the permuted datasets. Unexpectedly, these spurious DEGs' top 5 enriched gene ontology (GO) terms included immunerelated terms (Fig.Â 1E). Hence, if these spurious DEGs were not removed by FDR control, they would mislead researchers to believe that there was an immune response difference between prenivolumab and onnivolumab patients, an undoubtedly undesirable consequence that DEG analysis must avoid.
Then, a question followed: why did DESeq2 and edgeR make so many falsepositive discoveries from this immunotherapy dataset? Our immediate hypothetical reason was the violation of the negative binomial model assumed by both DESeq2 and edgeR [13]. To check this hypothesis, we selected two groups of genes: (1) the genes identified as DEGs from â‰¥ 20% permuted datasets and (2) the genes identified as DEGs from â‰¤ 0.1% permuted datasets; then, we evaluated how well the negative binomial model fit the genes in each group. In line with our hypothesis, the model fitting was worse for the genes in the first group, consistent with the fact that these genes were spurious DEGs (Fig.Â 1F and Additional file 1: Fig. S3). Further checking of the spurious DEGs enriched in immunerelated GO terms revealed that these genes also had worse model fitting than those genes in the second group (Additional file 1: Fig. S4). Considering that a likely cause of the model violation is the existence of outliers, we examined all the genes that were mistakenly identified as DEGs in at least 10% of permuted datasets, and we detected the existence of outliers in all these genesâ€™ measurements relative to the assumed negative binomial model (Additional file 1: Fig. S5). It is well known that estimating the mean is not informative in the existence of outliers. However, in parametric methods like edgeR and DESeq2, the null hypothesis is that a gene has the same mean under the two conditions. Hence, it is expected that the testing result would be severely affected by the existence of outliers. In contrast, the Wilcoxon ranksum test is more robust to outliers due to its different null hypothesis: a geneâ€™s measurement under one condition has equal chances of being less or greater than its measurement under the other condition. That is, the Wilcoxon ranksum test concerns more about the ranks than the magnitudes of measurements, making it robust to outliers.
Motivated by these findings, we further benchmarked DESeq2 and edgeR along with four other representative DEG identification methods on this immunotherapy dataset and the other 12 populationlevel RNAseq datasets from two large consortia, the GenotypeTissue Expression (GTEx) project [7] and the Cancer Genome Atlas (TCGA) [6] (Additional file 1: Table S1). In particular, on GTEx datasets, differential expression analysis can be performed between two tissues or cell types; on TCGA datasets, differential expression analysis can be performed between two disease statuses or biological conditions. The four representative methods include two popular methods limmavoom [14, 15] and NOISeq [16], a new method dearseq [11] (which claimed to overcome the FDR inflation issue of DESeq2 and edgeR on largesamplesize data), and the classic Wilcoxon ranksum test [17]. Note that DESeq2, edgeR, and limmavoom are parametric methods that assume parametric models for data distribution, while NOISeq, dearseq, and the Wilcoxon ranksum test are nonparametric methods that are less restrictive but require large sample sizes to have good power. (Also note that the GTEx project used DESeq2 and NOISeq for DEG identification.) Using permutation analysis on these datasets, we found that DESeq2 and edgeR consistently showed exaggerated false positives (reflected by their actual FDRs far exceeding the target FDR thresholds) compared to the other four methods (Additional file 1: Figs. S6S17).
While the permutation analysis created true negatives (nonDEGs) to allow FDR evaluation, it did not allow the evaluation of DEG identification power, which would require true positives (DEGs) to be known. Hence, we generated 50 (identically and independently distributed) semisynthetic datasets with known true DEGs and nonDEGs from each of the 12 GTEx and TCGA datasets. Then, we used these semisynthetic datasets to evaluate the FDRs and power of the six DEG identification methods (Methods). In comparing 386 heart left ventricle samples and 372 atrial appendage samples in a GTEx dataset, only the Wilcoxon ranksum test consistently controlled the FDR under a range of thresholds from 0.001 to 5% (Fig.Â 2A). In contrast, the other five methods, especially DESeq2 and edgeR, failed to control the FDR consistently. Moreover, we compared the power of the six methods conditional on their actual FDRs (Methods). (Due to the tradeoff between FDR and power, power comparison is only valid when the actual FDRs are equal.) As shown in Fig.Â 2A, the Wilcoxon ranksum test outperformed the other five methods in terms of power.
Finally, to investigate how sample sizes would influence the performance of the six methods, we downsampled each semisynthetic dataset to obtain percondition sample sizes ranging from 2 to 100. Again, only the Wilcoxon ranksum test consistently controlled the FDR at all sample sizes (Fig.Â 2B, Additional file 1: Fig. S18). Granted, at the FDR threshold of 1%, the Wilcoxon ranksum test had almost no power when the percondition sample size was smaller than 8â€”an expected phenomenon for its nonparametric nature. However, when the percondition sample size exceeded 8, the Wilcoxon ranksum test achieved comparable or better power than the three parametric methods (DESeq2, edgeR, and limmavoom) and the new method dearseq, and it clearly outpowered NOIseq (Fig.Â 2B, Additional file 1: Fig. S18). Considering that the proportion of DEGs might affect the performance of DEG identification methods [10], we next generated semisynthetic datasets with five proportions of DEGs (1%, 3%, 5%, 9%, and 20%) and evaluated the performance of DEG identification methods accordingly. The results show that the Wilcoxon ranksum test consistently controlled the FDR and achieved the highest power (conditional on the actual FDRs) across all proportions of DEGs (Additional file 1: Fig. S19). In contrast, other methods consistently failed to control the FDR across all proportions. These observations were consistent across all 600 semisynthetic datasets (Additional file 1: Figs. S20S30).
The three parametric methodsâ€”DESeq2, edgeR, and limmaâ€”have long been dominant in transcriptomic studies. For example, the GTEx project, a consortium effort studying gene expression and regulation in normal human tissues, used DESeq2 coupled with NOISeq to find DEGs between tissues [18]; several studies applied edgeR or limma to TCGA RNAseq data to find DEGs between tumor and normal samples [19,20,21]; moreover, researchers used DESeq2 to detect DEGs between responders and nonresponders of the immunotherapy [8, 22]. However, while the three parametric methods were initially designed to address the smallsamplesize issue, these populationlevel studies had much larger sample sizes (at least dozens) and thus no longer needed restrictive parametric assumptions. Moreover, violation of parametric assumptions would lead to illbehaved pvalues and likely failed FDR control [23], an issue independent of the sample size.
Although several studies had evaluated the performance of various DEG identification methods before our study [9, 10, 24,25,26,27,28,29,30,31,32,33,34], they had been restricted to using simulated datasets with small sample sizes (Additional file 2). Unlike all previous studies, our study evaluated the FDRs of DEG identification methods by permuting real RNAseq datasets, without relying on any model assumptions. We could use real datasets for FDR evaluation because our datasets have large sample sizes, allowing us to generate a large number of permuted datasets to ensure accurate FDR estimation. In contrast, previous studies had focused on small sample sizes, a scenario where FDR cannot be reliably estimated by permutation because the number of possible permutations is too small; that is why they all had to use modelbased simulation for FDR estimation, leaving the doubt if their simulated data resembled real data.
Another novelty of our study is the recommendation of the classical Wilcoxon ranksum test, which had been ignored by all existing benchmark studies (Additional file 2). For the first time, we found that the Wilcoxon ranksum test consistently controlled the FDR and achieved good power for DEG identification from largesamplesize RNAseq data. Although the recently developed dearseq method [11] was claimed to overcome the inflated FDR issue of DESeq2 and edgeR, our evaluation shows that dearseq still had the issue and did not outperform the Wilcoxon ranksum test.
In summary, our study shows the superiority of the Wilcoxon ranksum test, a powerful and robust nonparametric test also known as the MannWhitney test developed in the 1940s [17, 35,36,37,38], for twocondition comparisons on largesamplesize RNAseq datasets. The Wilcoxon ranksum test is known to be powerful for skewed distributions, as is the case with gene expression counts measured by RNAseq. Our results also echo the importance of verifying FDR control by permutation analysis. Beyond RNAseq data analysis, our study suggests that, for populationlevel studies with large sample sizes, classic nonparametric statistical methods should be considered asÂ the baseline for data analysis and new method benchmarking.
We did not include the twosample t test or the Welchâ€™s correction as an alternative to the Wilcoxon ranksum test for three reasons. First, if a geneâ€™s two sets of observations under the two conditions (i.e., two samples) are from nonGaussian distributions (which is the case for RNAseq data), the t test is only valid when the two sample sizes are large and the central limit theorem holds (then the gene's two sample means are approximately Gaussian distributed); however, the central limit theorem only holds when the two samples are from distributions with finite second moments, excluding the possibility that samples may come from heavytailed distributions with infinite second moments [39]. Second, the t test is not invariant to nonlinear monotone transformations on the observations; for example, the two samples may have the same population mean on the original scale (i.e., the null hypothesis is true) but different population means on the log scale (i.e., the null hypothesis is false); there is no consensus on what transformation is optimal for RNAseq data. In contrast, the Wilcoxon ranksum test is invariant to monotone transformations. Third, the t test only concerns the mean difference between the two samplesâ€™ distributions, but the mean parameter is not informative if a distribution is heavily skewed, and estimating the mean parameter is not robust to the existence of outliers. Unlike the t test, the Wilcoxon ranksum test has no requirement on the data distributions. It concerns the null hypothesis that a random observation from one distribution has equal chances of being less or greater than a random observation from the other distribution, an informative null hypothesis to test against regardless of the skewness of distributions. Admittedly, the DE genes found by the Wilcoxon ranksum test may have distributional differences other than the mean difference between the two conditions [40], but we do not regard this as a notable drawback of the Wilcoxon ranksum test. The reason is that genes may still be of biological interest if they have the same mean but different spread under the two conditions.
Finally, we note that, unlike DESeq2, edgeR, limmavoom, and dearseq, the Wilcoxon ranksum test is a nonregressionbased method, making it unable to adjust for confounders. Hence, to use the Wilcoxon ranksum test for DEG identification, researchers can normalize RNAseq samples to remove batch effects or use the probabilistic index models to adjust the Wilcoxon ranksum test for covariates [41, 42].
Conclusions
In conclusion, when the percondition sample size is less than 8, parametric methods may be used because their power advantage may outweigh their possibly exaggerated false positives. However, if users are concerned about FDR control, our recent method Clipper provides a pvaluefree FDR control solution for smallsamplesize data [43]; for largesamplesize data, the Wilcoxon ranksum test is our recommended choice for its solid FDR control and good power.
Methods
Selection of DEG identification methods
We selected the three parametric methods for identifying differentially expressed gene (DEGs) from RNAseq data based on popularity: DESeq2 [4], edgeR [5], and limmavoom [15] (with 33,969, 24,037, and 15,786 citations, respectively, in Google Scholar as of 24 February 2022). We chose the nonparametric method NOISeq [16] because the GenotypeTissue Expression (GTEx) consortium used it to identify DEGs between tissues, and we used GTEx RNAseq datasets in our study. We also chose the newly developed nonparametric method dearseq [11], which claimed that it overcomes the FDR control issue of DESeq2, edgeR, and limmavoom on largesamplesize data. Moreover, we included the Wilcoxon ranksum test, a classical nonparametric statistical test that compares two samples (i.e., a geneâ€™s two sets of expression levels measured under two conditions).
Datasets
Since we focused on large samples, we chose RNAseq datasets from spotlight populationlevel studies (where samples are from healthy individuals or patients) including GTEx [7], TCGA [6], and an immunotherapy study [8]. We did not select datasets by any criteria other than the sample size (we requiredÂ at least 50 samples per condition). In particular, GTEx and TCGA are two consortia that generated largescale RNAseq datasets. On GTEx datasets, differential expression analysis can be performed between two tissues or cell types. On TCGA datasets, differential expression analysis can be performed between two disease statuses or biological conditions. The immunotherapy dataset represents an important research topic where differential expression is performed to understand the effectiveness of immunotherapy treatment.

For the immunotherapy study, we selected one dataset with a total sample size of 109, including 51 prenivolumab and 58 onnivolumab antiPD1 therapy melanoma samples (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE91061).

For TCGA data, we selected RNAseq datasets of six cancer types, which have paired normal tissues and sample sizes greater than 50 for both cancer and normal tissues. Then, we downloaded the gene read count matrices of these selected datasets from GDC Xena Hub (https://xenabrowser.net/datapages/?hub=https://gdc.xenahubs.net:443, release v18.0).

For GTEx data, we selected six pairs of tissues with sample sizes ranging from 126 to 706. Then we downloaded the gene read count matrices of these tissue samples from the GTEx Portal (https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_20170605_v8_RNASeQCv1.1.9_gene_reads.gct.gz, GTEx Analysis V8).
Additional file 1: Table S1 lists the detailed information of the datasets used in this study.
The identification of DEGs
All six methods (DESeq2, edgeR, limmavoom, NOISeq, dearseq, and the Wilcoxon ranksum test) took a read count matrix and a condition label vector as input. The parameters were set based on the user guides of these methodsâ€™ software packages.

For DESeq2 (v1.28.1), we used the DESeq function to perform differential analysis, followed by generating the results using the results function.

For edgeR (v3.30.3), we first filtered out genes with very low counts using the filterByExpr function, followed by normalization using the trimmed mean of M values (TMM) method. Then, the quasilikelihood Ftest was used for differential analysis.

For limmavoom (v3.44.3), we filtered genes and calculated the normalization factor in the same way as we did for edgeR. Then we applied the voom transformation to the normalized and filtered count matrix and performed the differential analysis using the lmFit and eBayes functions.

For NOISeq (v2.31.0), we used the noiseqbio function to identify DEGs.

For dearseq, the filtering and normalization steps were the same as those for edgeR. Then, we used the dear_seq function with the asymptotic test to identify DEGs.

For the Wilcoxon ranksum test, the filtering and normalization steps were the same as those for edgeR. For pvalue calculation, we input each geneâ€™s countspermillion (CPM) values into the wilcox.test function in R (v4.0.2). Then, we set a pvalue cutoff based on an FDR threshold using the Benjamini & Hochberg method.
Specifically, DEGs were selected based on the corresponding FDR threshold for all the six methods (FDRâ€‰<â€‰0.05 for the immunotherapy dataset; FDRâ€‰<â€‰0.01 for GTEx and TCGA datasets).
The generation of permuted and semisynthetic data from the original RNAseq data
From each original RNAseq dataset, we generated permuted datasets between two conditions (pretherapy and ontherapy samples for the immunotherapy data; two tissue types for GTEx data; normal and tumor samples for TCGA data). We used M to denote a genebysample read count matrix (with genes as rows and samples as columns) and C to denote the vector of sample conditions labels (corresponding to the columns of M). Then, we generated a permuted dataset by randomly permuting all values in C and keeping the original order of samples in M. We repeated this permutation procedure for 1000 times to generate 1000 permuted datasets.
The semisynthetic datasets were generated based on original RNAseq samples from GTEx and TCGA. We first used all six DEG identification methods to identify DEGs from each original dataset containing two conditions. We then defined true DEGs as the genes identified as DEGs by all six methods at a very small FDR threshold (0.0001%). We used X and Y to denote the read count matrices from the two conditions, and X_{i} and Y_{i} to denote the read counts of gene i from the two conditions (i.e., the ith row of X and Y). Then, we generated semisynthetic datasets X^{â€²} and Y^{â€²} in the following way: for each semisynthetic datasets, we first randomly sampled k selected true DEGs from all true DEGs and preserved the read counts of these selected true DEGs; for each of the remaining genes, we randomly permuted its read counts between the two conditions. That is, \({\boldsymbol{X}}_{\boldsymbol{i}}^{\prime }={\boldsymbol{X}}_{\boldsymbol{i}}\) and \({\boldsymbol{Y}}_{\boldsymbol{i}}^{\prime }={\boldsymbol{Y}}_{\boldsymbol{i}}\) if gene i is a selected true DEG, \(\left({\boldsymbol{X}}_{\boldsymbol{i}}^{\prime },{\boldsymbol{Y}}_{\boldsymbol{i}}^{\prime}\right)={\sigma}_i\left({\boldsymbol{X}}_{\boldsymbol{i}},{\boldsymbol{Y}}_{\boldsymbol{i}}\right)\) if gene i is not a selected true DEG, where Ïƒ_{i} is a random permutation of values in X_{i} and Y_{i}. We repeated this procedure independently for 50 times to generate 50 semisynthetic datasets. To generate downsampled semisynthetic datasets with a percondition sample size of n, we randomly sampled n columns from X^{â€²} and Y^{â€²} each. The number of selected true DEGs k is 50% of the number of all true DEGs in most results. For results in first 4 rows of Additional file 1: Fig. S19, we also variedÂ the percentages of selected true DEGs: 10%, 30%, and 90% of all true DEGs (1%, 3%, 9% of all genes). For the results in the last row of Additional file 1: Fig. S19, we defined selected true DEGs as the genes identified as DEGs by all six methods at FDR thresholdâ€‰=â€‰1%.
Calculation of FDR, power, and poorness of model fit
The FDR is defined as the expectation of the false discovery proportion (FDP), the proportion of false positives among all the discoveries. The FDR cannot be directly observed, but the FDP can be calculated from benchmark datasets with known true positives and negatives. In our semisynthetic data analysis, we defined true DEGs as true positives and the remaining genes as true negatives. First, we calculated the FDP of each DEG identification method (e.g., DESeq2) on each semisynthetic dataset. Second, we calculated the methodâ€™s (approximate) FDR by taking the average of its FDPs on the 50 semisynthetic datasets.
The power of a DEG identification method is defined as the probability of identifying a gene as a DEG conditional on that the gene is a true DEG. It can also be considered as the expectation of the empirical power, which is the proportion of true DEGs being identified as DEGs. In our semisynthetic datasets with true DEGs and true nonDEGs, we calculated the power of a method by taking the average of its empirical power on the 50 semisynthetic datasets, similar to how we calculated the FDR.
We used the goodnessoffit test to evaluate how well a geneâ€™s read counts under a condition can be fit by the negative binomial models estimated by DESeq2 and edgeR. To remove batch effects, we used the normalized read counts output by DESeq2 and edgeR. For each gene under each condition and each method (DESeq2 or edgeR), we conducted the goodnessoffit test on the methodâ€™s normalized read counts, with the methodâ€™s estimated dispersion parameter of the negative binomial distribution. The goodnessoffit test was implemented using the function goodfit in the R package vcd asÂ
summary(goodfit(round(normalized_counts), type = "nbinomial", par = list(size = 1/dispersion)))
which returns a pvalue. A smaller pvalue indicates a poorer fit. Hence, we defined the poorness of fit as the negative log_{10Â }(pvalue).
Availability of data and materials
All the source code and permuted and semisynthetic datasets used to generated results can be found at Zenodo [44].
All the codes used to generate results can be found at GitHub via URL https://github.com/xihuimeijing/DEGs_Analysis_FDR [45].
A tutorial for identifying DEGs using the Wilcoxon ranksum test can be found at https://rpubs.com/LiYumei/806213.
References
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344â€“9.
Wang Z, Gerstein M, Snyder M. RNASeq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57â€“63.
Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019;20:631â€“56.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:550.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139â€“40.
Cancer Genome Atlas Research N, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer Genome Atlas PanCancer analysis project. Nat Genet. 2013;45:1113â€“20.
Consortium GT. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318â€“30.
Riaz N, Havel JJ, Makarov V, Desrichard A, Urba WJ, Sims JS, et al. Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell. 2017;171:934â€“949 e916.
Schurch NJ, Schofield P, Gierlinski M, Cole C, Sherstnev A, Singh V, et al. How many biological replicates are needed in an RNAseq experiment and which differential expression tool should you use? RNA. 2016;22:839â€“51.
Corchete LA, Rojas EA, AlonsoLopez D, De Las RJ, Gutierrez NC, Burguillo FJ. Systematic comparison and assessment of RNAseq procedures for gene expression quantitative analysis. Sci Rep. 2020;10:19737.
Gauthier M, Agniel D, Thiebaut R, Hejblum BP. dearseq: a variance component score test for RNAseq differential analysis that effectively controls the false discovery rate. NAR Genom Bioinform. 2020;2:lqaa093.
Ebrahimpoor M, Goeman JJ. Inflated false discovery rate due to volcano plots: problem and solutions. Brief Bioinform. 2021;22:bbab053.
Hawinkel S, Rayner JCW, Bijnens L, Thas O. Sequence count data are poorly fit by the negative binomial distribution. PLoS One. 2020;15:e0224909.
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNAseq read counts. Genome Biol. 2014;15:R29.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.
Tarazona S, FurioTari P, Turra D, Pietro AD, Nueda MJ, Ferrer A, et al. Data quality aware analysis of differential expression in RNAseq with NOISeq R/Bioc package. Nucleic Acids Res. 2015;43:e140.
Wilcoxon F. Individual comparisons of grouped data by ranking methods. J Econ Entomol. 1946;39:269.
Mele M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. The human transcriptome across tissues and individuals. Science. 2015;348:660â€“5.
Peng L, Bian XW, Li DK, Xu C, Wang GM, Xia QY, et al. Largescale RNASeq transcriptome analysis of 4043 cancers and 548 normal tissue controls across 12 TCGA cancer types. Sci Rep. 2015;5:13413.
Tang Z, Li C, Kang B, Gao G, Li C, Zhang Z. GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses. Nucleic Acids Res. 2017;45:W98â€“W102.
Rosario SR, Long MD, Affronti HC, Rowsam AM, Eng KH, Smiraglia DJ. Pancancer analysis of transcriptional metabolic dysregulation using The Cancer Genome Atlas. Nat Commun. 2018;9:5330.
Gide TN, Quek C, Menzies AM, Tasker AT, Shang P, Holst J, et al. Distinct immune cell populations define response to antiPD1 monotherapy and antiPD1/antiCTLA4 combined therapy. Cancer Cell. 2019;35:238â€“255 e236.
Benjamini Y, Hochberg Y. Controlling the false discovery rate  a practical and powerful approach to multiple testing. J R Stat Soc Ser B Stat Methodol. 1995;57:289â€“300.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNASeq experiments. BMC Bioinformatics. 2010;11:94.
Kvam VM, Liu P, Si Y. A comparison of statistical methods for detecting differentially expressed genes from RNAseq data. Am J Bot. 2012;99:248â€“56.
Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, et al. Comprehensive evaluation of differential gene expression analysis methods for RNAseq data. Genome Biol. 2013;14:R95.
Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013;14:91.
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, et al. A comparative study of techniques for differential expression analysis on RNASeq data. PLoS One. 2014;9:e103207.
Seyednasrollah F, Laiho A, Elo LL. Comparison of software packages for detecting differential expression in RNAseq studies. Brief Bioinform. 2015;16:59â€“70.
CostaSilva J, Domingues D, Lopes FM. RNASeq differential expression analysis: an extended review and a software tool. PLoS One. 2017;12:e0190152.
Williams CR, Baccarella A, Parrish JZ, Kim CC. Empirical assessment of analysis workflows for differential expression analysis of human samples using RNASeq. BMC Bioinformatics. 2017;18:38.
Quinn TP, Crowley TM, Richardson MF. Benchmarking differential expression analysis tools for RNASeq: normalizationbased vs. logratio transformationbased methods. BMC Bioinformatics. 2018;19(274).
Baik B, Yoon S, Nam D. Benchmarking RNAseq differential expression analysis methods using spikein and simulation data. PLoS One. 2020;15:e0232271.
Li X, Cooper NGF, O'Toole TE, Rouchka EC. Choice of library size normalization and statistical methods for differential gene expression analysis in balanced twogroup comparisons for RNAseq studies. BMC Genomics. 2020;21:75.
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50â€“60.
Hodges JL, Lehmann EL. The efficiency of some nonparametric competitors of the ttest. Ann Math Stat. 1956;27:324â€“35.
Chernoff H, Savage IR. Asymptotic normality and efficiency of certain nonparametric test statistics. Ann Math Statist. 1958;29:972â€“94.
Fay MP, Proschan MA. WilcoxonMannWhitney or ttest? On assumptions for hypothesis tests and multiple interpretations of decision rules. Stat Surv. 2010;4:1â€“39.
A generalized central limit theorem. Wikipedia. 2022, https://en.wikipedia.org/wiki/Stable_distribution#A_generalized_central_limit_theorem.
Fagerland MW. ttests, nonparametric tests, and large studiesa paradox of statistical practice? BMC Med Res Methodol. 2012;12:78.
Thas O, Neve JD, Clement L, Ottoy JP. Probabilistic index models. J R Stat Soc Ser B Stat Methodol. 2012;74:623â€“71.
De Neve J, Thas O, Ottoy JP, Clement L. An extension of the WilcoxonMannWhitney test for analyzing RTqPCR data. Stat Appl Genet Mol Biol. 2013;12:333â€“46.
Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, et al. Clipper: pvaluefree FDR control on highthroughput data from two conditions. Genome Biol. 2021;22:288.
Li Y, Ge X. Processed datasets for differential expression analysis on polulationlevel RNAseq data. Zenodo. 2022; https://doi.org/10.5281/zenodo.5241320.
Li Y, Ge X. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Github. 2022; https://github.com/xihuimeijing/DEGs_Analysis_FDR.
Acknowledgements
We thank Jason Sheng Li of Wei Li lab for suggestions on the title. We also thank other members of Wei Li lab and Jingyi Jessica Li lab for helpful discussions.
Review history
The review history is available as Additional file 3.
Peer review information
Barbara Cheifet and Stephanie McClelland were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
This work was supported by the following grants: The U.S. National Institutes of Health R01CA193466 and R01CA228140 (to W.L.); NIH/NIGMS R01GM120507 and R35GM140888, NSF DBI1846216 and DMS2113754, Johnson & Johnson WiSTEM2D Award, Sloan Research Fellowship, and UCLA David Geffen School of Medicine W.M. Keck Foundation Junior Faculty Award (to J.J.L.).
Author information
Authors and Affiliations
Contributions
Y.L., W.L., and J.J.L. conceived and supervised this project. Y.L. and X.Z. performed the data analysis. Y.L., X.Z., F.P., J.J.L, and W.L. interpreted the data and wrote the manuscript. The authors read and approved the final manuscript.
Authorsâ€™ information
Twitter handles: Wei Li (@superweili) and Jingyi Jessica Li (@jsb_ucla)
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing financial interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Figs. S1 to S30 and Table S1.
Additional file 2.
Summary of studies comparing DEG analysis methods.
Additional file 3.
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, Y., Ge, X., Peng, F. et al. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol 23, 79 (2022). https://doi.org/10.1186/s13059022026484
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13059022026484