A practical guide to methods controlling false discoveries in computational biology

Background In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. Results Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. Conclusions Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries. Electronic supplementary material The online version of this article (10.1186/s13059-019-1716-1) contains supplementary material, which is available to authorized users.

setting due to due to distributional assumptions of the methods (see Figure 1). Method is denoted by color and the mean value across 100 replications with standard error bars is shown in (B)-(E).

A B
Simulation Figure S11 Gain from informative covariate varies by case study. (A) For each method (y-axis) that uses an informative covariate, the percent change in rejections when using an informative covariate as compared to using a completely uninformative covariate is represented by color. This is defined as number of rejections when using the informative covariate divided by the number of rejections when using the uninformative (random) covariate, multiplied by 100. This value is averaged across all datasets and informative covariates used in each case study (x-axis). If rejections were found using the informative covariate but none were found using the uninformative covariate, the percentage was set to 100%. The maximum value of this percentage in each column is labeled. (B) For each method (y-axis) that uses an informative covariate, the absolute percentage change in FDR (left) and TPR (right) in the yeast in silico experiments are represented in color (setting with 5 samples in each group, unimodal effect size distribution, and 30% non-null genes). Results are averaged over 100 simulation replications.       (2) RNAseq (2) ChIPseq (4) GSEA (2) microbiome (12) scRNAseq (12) Case Study Method 0% 25% 50% 75% 100% % Rejected (relative to max)

A B
Simulation Figure Table S4.    Figure S17 Gain from informative covariate varies by dataset and covariate in case studies. For each method (y-axis) that uses an informative covariate, the percent change in rejections when using an informative covariate as compared to using a completely uninformative covariate is represented by color. This is defined as number of rejections when using the informative covariate divided by the number of rejections when using the uninformative (random) covariate, multiplied by 100. This is shown separately for each dataset (grouped by case study and informative covariate, x-axis). If rejections were found using the informative covariate but none were found using the uninformative covariate, the percentage was set to 100%. The maximum value of this percentage in each column is labeled. The informative covariate used in each case study is listed in Table 1. For case studies with more than one covariate, the covariate is denoted in the column labels.  (1) p-values, Modifies the standard two-group model by assuming known group structure and using different cutoffs for each group. FDR is controlled at different rates for each group to minimize the global false nondiscovery rate (FNR) subject to a constraint on the global FDR.
none [4] AdaPT (adapt-glm) [10] q- FDR Regression (FDRreg) (empirical) [13] (1) z-scores,  [13] Adaptive Shrinkage (ASH) [2] (1) effect sizes, Introduces the concept of the local false sign rate and s-values for controlling errors across multiple tests. Same approach can also be used to compute q-values and the local false discovery rate. Assumes that the distribution of effect sizes is unimodal.
package: ashr function(s): ash, get_qvalue [1] Formally, the BH approach does not generate adjusted p-values, but instead provides significance calls at a specified α FDR cutoff. Adjusted p-values are commonly computed as the smallest FDR cutoffs at which each test would be called significant. [2] The q-value is defined as the positive FDR (pFDR) analogue of the p-value. Approach can also be used to compute the local false discovery rate. [3] Requires specification of a degrees of freedom parameter. Requires multiplying weights against BH adjusted p-values to obtain adjusted p-values [4] Custom implementation using fdrtools package (function: fdrtools). Requires manually specifying covariate groups, computing local fdr with fdrtools package, followed by custom code [5] q-value is defined as the minimum target FDR level such that the p-value is rejected. For hypotheses with p-values above the initial threshold value (default 0.45), the q-values are set to ∞ because they are never rejected by adapt-glm for any α. [6] The Barber-Candès procedure estimates the false discovery proportion by quantifying the asymmetry in the distribution of p-values at a given threshold. [7] adapt_glm, adapt_gam, and adapt_glmnet are all wrappers around adapt that encode a specific assumption about the relationship between π(x) and µ(x). Requires manually specifying the models for these relationships with, e.g. splines package. [8] Requires manually specifying model matrix with, e.g. splines package [9] The method also returns local false sign rates, local false discovery rates, and s-values, where s-vales are defined analogous to Storey's q-value, but with the local false sign rate rather than the local false discovery rate. Since the aim of our analysis was to compare methods for controlling the FDR, we only report results for the estimated q-values.

Table S2
Yeast in silico experiment settings. The results from each series of simulations is reported as a separate Additional file in the supplementary materials, with the exception of the Null series, which is combined with the 'Unimodal Alternative, High π 0 ' series. Both the 'Null' and 'Unimodal Alternative, High π 0 ' series are also evaluated in the Polyester in silico experiments, and these results are provided as a separate separate file.

Series
Non-null Effect Non-null Genes Covariate Strength [10] Figures Additional Size Distribution [11] N(%) [12] File Unimodal Alternative, High π 0 Unimodal 2000 (≈30%) Strong S1, S15, S11, S3, S10 2 Bimodal 500 (≈7.5%) Strong S3, S10 5 Weak Uninformative [10] In all cases, non-null genes were selected using probability weights sampled from a logistic function (where weights w(u) = 1 1+e −10u+5 , and u ∼ U (0, 1)). The strongly informative covariate Xs was equal to the logistic sampling weight w. The weakly informative covariate Xw was equal to the logisitic sampling weight plus noise: w + , where ∼ N (0, 0.25), truncated such that Xw ∈ (0, 1). The uninformative Xu covariate was unrelated to the sampling weights and drawn from a uniform distribution such that Xu ∼ U (0, 1) [11] For unimodal alternative effect size distributions, the observed fold changes for the selected non-null genes in a non-null empirical comparison were used. For bimodal alternatives, observed test statistics z from an empirical non-null comparison were sampled with probability weights w(z) = f (|x|; α, β), where f is the Gamma probability density function (with shape and rate parameters α = 4.5 and β = 1 − 1e −4 , respectively). The corresponding effect sizes (fold changes, F C) for ashq were calculated assuming a fixed standard error: F C = zσm, where σm is the median standard error of the log 2 fold change across all genes. [12] Total number of genes considered (with mean expression across all samples greater than 1 raw count) is 6553. A small number of genes are removed from each replicate if DESeq2 does not return a p-value.

Supplementary in silico experiment results
Here we summarize the benchmarking results of fdrreg-e, which was excluded from the main results due to its unstable and often inferior performance compared to its counterpart fdrreg-t. The difference between these two implementations of FDRreg is that fdrreg-t assumes the null distribution of test statistics is standard normal, while fdrreg-e estimates the null distribution of test statistics empirically. We find this estimation procedure to be sensitive to settings of the distribution of effect sizes and proportion of non-null tests in particular.

Summary of fdrreg-e performance
We found that while modern FDR methods generally led to a higher true positive rate (TPR), or power, in the in silico experiments and simulations, fdrreg-e was sometimes as conservative as the Bonferroni correction ( Figure S3). The increase in TPR of fdrreg-e sometimes showed substantial improvement over modern methods in several simulation settings ( Figures S3, S4, S5, S6, S7, S8). However, these gains were often accompanied by a lack of FDR control, highlighting the sensitivity of fdrreg-e to underlying model assumptions.

Number of tests
We observed that the FDR control of fdrreg-e was sensitive to the number of tests in simulation. Specifically, FDR was substantially inflated when fdrreg-e was applied to fewer than 1,000 tests ( Figure S4C). FDR control generally improved as the number of tests increased.

Proportion of non-null tests
The performance of fdrreg-e was particularly sensitive to extreme changes in the proportion of non-null tests.
In simulations, fdrreg-e exhibited inflated FDR when the proportion of non-null hypotheses was near 50% ( Figure S4E), and suffered from low TPR when there were more than 20% non-null hypotheses, excluding settings where the FDR was not controlled ( Figure S4F). In the yeast in silico experiments, we also observed that fdrreg-e was more conservative when the proportion of non-null genes was 30% compared to when it was 7.5% ( Figure S3). Similar results were also observed in a series of simulations where unimodal effect sizes were used when the proportion of non-null tests was increased from 10% ( Figure S7) to 25% ( Figure S8).

Distribution of test statistics
We observed that the performance of fdrreg-e declined when the normality assumption of the test statistic was violated ( Figure S6B-C). FDR was considerably inflated when it was applied to t-distributed test statistics. As expected, the increase in FDR was greater for the heavier-tailed t distribution with fewer degrees of freedom ( Figure S6B).

Distribution of effect sizes
In addition to distributional assumptions on the test statistic, empirical FDRreg requires distributional assumptions on the effect size. Specifically, the empirical null framework used in fdrreg-e relies on [14] to estimate the distribution of null test statistics which requires that all test statistics with values near zero are null, referred to as the 'zero assumption'. If this is not true, as is the case when the effect sizes are unimodal, the estimation of the null distribution is unidentifiable and may become overly wide, resulting in conservative behavior.
To investigate the sensitivity of these methods to the distribution of effect sizes, multiple distributions of the effect sizes were considered in both yeast in silico experiments and simulations. Both unimodal effect size distributions and those following the assumption of fdrreg-e, with most non-null effects greater than zero ( Figure S6A), were considered. While most simulations included the latter, simulations were also performed with a set of unimodal effect size distributions described in [2] (Figure S7 and S8). In the yeast in silico experiments, two conditions were investigated -a unimodal and a bimodal case.
As expected, we observed that when the zero assumption of empirical FDRreg is violated, fdrreg-e was more conservative in both the yeast in silico experiments ( Figure S3) and in simulation (Figures S7 and S8).
We also note that while it is simple to check distributional assumptions on the overall distribution of test statistics or effect sizes, in practice it is impossible to check the distributional assumptions of empirical FDRreg under the alternative, since they rely on knowing which tests are non-null.
To illustrate what types of covariates may be informative in controlling the FDR in different computational biology contexts, we compared the methods using six case studies including genome-wide association testing (Section 2.1), gene set analysis (Section 2.2), detecting differentially expressed genes in bulk RNA-seq (Section 2.3) and single-cell RNA-seq (Section 2.4), differential binding in ChIP-seq (Section 2.5), and differential abundance testing in the microbiome (Section 2.6). Here we provide additional results for each case study to complement the summary provided in the main text. For full details of the analyses and results, refer to Additional files 21-41.

Case-study: Genome-Wide Association Studies
Genome-Wide Association Studies (GWAS) are typically carried out on large cohorts of independent subjects in order to test for association of individual genetic variants with a phenotype. The genetic variants are generally measured using microarrays containing probes for up to several million Single Nucleotide Polymorphisms (SNPs). These SNP probes target single base-pair DNA sites that have been shown to vary across a population. To boost power, meta-analyses of GWAS group together many studies, commonly including hundreds of thousands to millions of SNPs, with heterogeneous effect sizes and a wide range of sample size at each loci.
We analyzed a GWAS experiment that carried out a meta-analysis of hundreds of thousands of individuals for more than two million SNPs for association of genetic variants with Body Mass Index (BMI) [15]. As informative covariates, we considered (1) the minor allele frequency (MAF), or the proportion of the population which exhibits the less common allele, and (2) the number of samples for which each SNP was tested for association in the corresponding meta-analysis. In total, 196,969 approximately independent SNPs (out of 2,456,142) were included in the FDR analysis.
For each covariate, we examined whether its rank was associated with the p-value distribution. As expected, larger values of the sample size resulted in an enrichment for smaller p-values. Additionally, intermediate values of the MAF were associated with an enrichment for smaller p-values. This is expected since an MAF near 0.5 balances the number of samples with each allele, thereby maximizing power to detect a difference. For both covariates, the distribution of moderate to large p-values appeared uniform and independent of the value of the covariate. For methods that include a covariate, similar numbers of SNPs were rejected at the 0.05 level for either covariate.
For both informative covariates, we found lfdr, ihw, and fdr-e rejected the largest number of hypotheses, followed by fdr-t. The sample size covariate appeared to be more informative than MAF for lfdr and ihw, as both methods rejected more than ashq, whereas ashq found more discoveries than all covariate-aware methods that used MAF. Neither covariate seemed to be very informative for bl, as it did not have much gain over bh or qvalue. adapt-glm was more conservative than Bonferroni with the MAF covariate, but was ranked above bl using the sample size covariate. The overlap among the methods was high, with the largest set sizes containing SNPs rejected by all methods except Bonferonni and/or adapt-glm for both covariate comparisons. The next largest set size was the SNPs rejected by ashq exclusively.

Case-study: Gene set analyses
Gene set analysis is commonly used to provide insights to results of differential expression analysis. These methods aim to identify gene sets such as Gene Ontology (GO) categories or biological pathways that exhibit a pattern of differential expression. One class of methods, called overrepresentation approaches, test each gene set for a higher number of differentially expressed genes than expected by chance [16]. Another class of methods, called functional class scoring approaches, test each gene set for a coordinated change in expression [16]. While the former operates on a list of differentially expressed genes and does not consider the magnitude or direction of effect, the latter uses information from all genes, and can even detect small coordinated changes across many genes that are not significantly DE individually. We investigated the use of an informative covariate in GOseq [17], an overrepresentation test, as well as Gene Set Enrichment Analysis (GSEA) [18], a functional class scoring approach. Since the sizes of gene sets differ substantially and these size differences translate into differences in power, we hypothesized that multiple-testing correction in gene set analysis would benefit from methods that incorporate information about set sizes.
We used two RNA-seq datasets that investigated gene expression changes (1) between cerebellum and cerebral cortex [19] and (2) upon differentiation of hematopoietic stem cells (HSCs) into multipotent progenitors (MPP) [20]. We obtained 9,853 and 1,336 differentially expressed genes with FDR below 0.10 (using BH) for the human and mouse datasets, respectively. We observed that for both GSEA and GOseq larger gene sets were more likely to have smaller p-values than smaller gene sets. Thus, the covariate was informative. In addition, the covariate appeared to be independent under the null hypothesis for GSEA, as evaluated by the histogram of p-values stratified by gene set size bins. However, upon evaluation of the stratified histograms of GOseq p-values, we observed that the distribution of p-values in the larger range was quite different for different covariate bins. This suggests that gene set size is not independent under the null hypothesis for GOseq, so the assumptions of methods which use an independent covariate are violated. Thus, we do not include the GOseq method in the benchmarking study and instead proceed with GSEA p-values only.
In this case study, we excluded the methods fdrreg-e, fdrreg-t, and ashq since they require standard errors and test statistics that GSEA does not provide. Overall lfdr, bl, and ihw rejected more hypotheses compared to the other methods. However, the ranking among these methods was not the same between the different datasets. Fort the mouse dataset, lfdr found the most rejections at smaller α levels (less than 0.05), but adapt-glm found the most at higher α levels. This was followed by BL, qvalue, and then IHW. For the human dataset, lfdr found the most rejections at all α levels, followed by IHW and then BL, and adapt-glm was more conservative than BH for almost all α levels. As expected, performance using the random (uninformative) covariate of BL and IHW was almost identical to qvalue and BH, respectively. However, the adapt-glm using the uninformative covariate was quite different in the two datasets, with no rejections in the human, and more rejections than any other method in the mouse (at α > 0.05).

Case-study: Differential gene expression in bulk RNA-seq
High-throughput sequencing of mRNA molecules has become the standard for transcriptome profiling. A central analysis task is to determine which genes are deferentially expressed between two biological conditions. Statistical models have been established to address this question including DESeq2 [21] and edgeR [22]. These methods return per gene p-values that are further adjusted for multiple testing, typically using the Benjamini-Hochberg procedure.
We assessed the performance of modern FDR methods in the context of differential expression on two RNAseq datasets. The first dataset consisted of two tissues of 10 individuals from the GTEx project and the second dataset consisted of a mouse knockdown experiment of the microRNA mir200c. For FDR methods that can use an informative covariate, we used mean expression across samples. We confirmed that this covariate was indeed informative for both datasets.
For the GTEx dataset, ashq found more rejections than any other method. At a FDR of 10%, the number of rejections of ashq was more than twice the number of rejections from any of rest of the methods, and the largest gene set was the set of genes found by ashq and no other methods. Following ashq, lfdr, adapt-glm, and fdrreg-t performed similarly. bl found almost the same number of rejections as qvalue, and ihw found slightly more than bh. fdrreg-e was as conservative as Bonferroni. The ranking of the methods based on the number of rejections was consistent across different strata of the covariate.
For the mir200c dataset the ranking of the methods was very different compared to the GTEx dataset. Here, fdrreg-e found the most rejections by far, and the largest gene set was the set of genes found by fdrreg-e and no other methods. The next highest ranking methods were lfdr, ihw, and ashq, followed by bl, qval, bh, fdrreg-t, and adapt-glm which all performed similarly. For this dataset, the ranking of methods changed substantially across strata of the covariate. For example, among the hypothesis falling between the 50th and 75th percentile of the covariate, lfdr was ranked second (the next highest ranked method after fdrreg-e) but among the hypothesis between the 75th and 100th percentile of the covariate, ashq was ranked second.
2.4 Case-study: Differential gene expression in single-cell RNA-seq Over the past 5 years, breakthroughs in microfluidics and droplet-based RNA capture technologies have made it possible to sequence the transcriptome of individual cells rather than populations of cells. Quantification of single-cell RNA-seq (scRNA-seq) reads results in a matrix of counts by cells for each sample. The primary applications of scRNA-seq have been in describing cellular heterogeneity in primary tissues, differences in cellular heterogeneity in disease, and discovery of novel cell subpopulations. Differential gene expression of scRNA-seq is used to determine gene sets which distinguish cell populations within the same biological condition, and between cell populations in different samples or conditions.
In this case-study, we examined differences in gene expression in two different biological systems. First, we detected differentially expressed genes between neoplastic glioblastoma cells sampled from a patient's tumor core with those sampled from nearby peripheral tissue [23]. In addition, we also detected differentially expressed genes between murine macrophage cells that were stimulated to produce an immune response with an unstimulated population [24]. We carried out differential expression analyses using two different methods developed for scRNAseq: scDD [25] and MAST [26], as well as the Wilcoxon Rank-Sum test.
We examined the mean nonzero expression and detection rate (defined as the proportion of cells expressing a given gene) as potentially informative covariates. For both datasets and all three differential expression methods, we found that mean nonzero expression and detection rate were both informative and approximately independent under the null hypothesis, satisfying the conditions for suitability of inclusion as an informative covariate for controlling FDR. All methods returned more rejections of genes with high nonzero mean and detection rate. They also tended to slightly favor genes with extremely low detection rate.
Across datasets, covariates, and differential expression tests, lfdr usually found the most rejections, followed by bl and adapt-glm. However, at smaller α values, adapt-glm was one of the most conservative methods. The ihw and qvalue methods were next, with their rank dependent the dataset and differential expression test used. While a gain in rejections for ihw over bh was apparent in the human dataset, the performance of ihw was very similar to bh in the mouse dataset.

Case-study: Differential binding in ChIP-seq
ChIP-seq has been widely used to detect protein binding regions and histone modifications in DNA. Testing difference of ChIP-seq signals between conditions usually contains two steps: firstly, defining sets of regions for which the ChIP-seq coverage are quantified; secondly, comparing quantified coverages for testing the statistical significance of differential binding regions. In the first step, regions can be defined by peak calling from samples, based on their signal in sliding windows [27], or by a priori interest. In this study we benchmarked results from the latter two approaches by analyzing H3K4me3 data from two widely studied cell lines. Because H3K4me3 is an active marker of gene expression, its signal is most active in promoter regions. This allowed us to pursue an analysis of promoters as regions of interest. We also benchmark the results using the sliding window approach csaw to define de novo regions on the H3K4me3 dataset as well as an additional dataset comparing CREB protein binding (CRB) in wild-type versus knock-out mice. We exclude ashq and FDRreg methods from the sliding window analyses since csaw does not provide a standard error or test statistic.
Based on observations that differentially bound peaks tend to have higher coverage, we investigated the use of mean coverage as an informative covariate. In the promoter analysis, the p-value histograms showed that high coverage groups are more highly enriched for significant p-values different, suggesting mean coverage is an informative covariate. Likewise, we observed that wider windows in the de novo analysis tend to have more significant p-values. The distribution of p-values under the null in both cases appeared approximately uniform.
In the promoter analysis, ashq detected the highest number of differential binding regions by far, followed by lfdr, fdrreg-t, bl, qvalue, and adapt-glm, which all performed similarly. In the sliding window analyses, lfdr rejected the most hypotheses in both datasets, followed by adapt-glm, bl, and qvalue, which performed similarly to one another. In both datasets, the next lowest methods were ihw and bh, where the advantage of ihw only observed in the CBP csaw analysis. Finally, fdrreg-e was more conservative than Bonferroni in the promoter analysis.
2.6 Case-study: Differential abundance testing and correlations in microbiome data analysis 16S rRNA sequencing provides an overview of the microbial community in a given sample, and is a common and accessible way to identify relationships between microbial communities and phenotypes of interest. For example, differential abundance testing is often used to identify bacterial taxa which are enriched or depleted in a disease state, and non-parametric correlations between taxa abundances and phenotypes can be calculated when phenotypes of interest are continuous (e.g. body mass index). However, 16S rRNA datasets are highdimensional, noisy, and sparse, and biological effects can be weak, complicating many statistical analyses and limiting power to detect true associations [28,29]. Furthermore, environmental samples tend to have many thousands of taxa, which further complicates our ability to identify significant associations We performed differential abundance tests on the OTU and genus levels for three different datasets from the microbiomeHD database: (1) obesity, where we do not expect a large disease-associated signal [28,30], (2) inflammatory bowel disease (IBD), which seems to have an intermediate number of disease-associated bacteria [31,32], and (3) infectious diarrhea (Clostridium difficile (CDI) and non-CDI), where the disease-associated signal is very strong [32,33]. We also performed Spearman correlation tests between OTU relative abundances and the respective values of three geochemical variables, measured from wells from a contaminated former S-3 waste disposal site in the Bear Creek watershed in Oak Ridge, Tennessee, part of the Department of Energy's Oak Ridge Field Research Center [34]. The geochemical variables were chosen based on their ability to be predicted by the microbial community in [34]: pH, Al, and SO 4 , where we expect strong, intermediate, and weak associations with the microbial abundances, respectively.
We examined the ubiquity (defined as the proportion of samples with non-zero abundance of each taxa) and mean non-zero abundance of taxa as potentially informative covariates. We found that ubiquity was informative and approximately independent under the null hypothesis, satisfying the conditions for suitability of inclusion as an informative covariate for controlling FDR. Mean non-zero abundance appeared less informative than ubiquity, as it typically showed a less striking pattern in diagnostic plots of p-values by covariate value. OTU-level differential abundance analyses did not have sufficient power to detect any significant differences in the IBD, CRC, and obesity datasets. Similarly, no OTUs correlated with SO 4 levels and ubiquity was not informative in this case. In addition, very few rejections were found in the CRC dataset at the genus level. Consequently, ubiquity was not informative in these "null" analyses and almost all FDR-correction methods found no significant associations. These "null" results are excluded from the results in the main text.
For the other analyses (genus-level differential abundance, OTU-level differential abundance in diarrhea, OTUlevel correlation analyses for pH and Al), ubiquity was informative and the FDR-correction methods which incorporated this information tended to recover more significant associations than naive methods. When there were enough tests for it to be applied, lfdr typically found the most rejections. This was usually followed by bl and qvalue, with the gain of bl over qvalue variable by dataset. In the correlation analyses, however, ihw found more rejections than bl and qvalue. The performance of adapt-glm was usually highly variable, both within a dataset and across datasets: it either had a very different ranking at different α levels (obesity), found the among the most rejections (correlation of pH), or found no rejections at all (IBD, CRC). In cases with very few tests (e.g. genus-level analyses), ihw used only 1 covariate bin and reduced to bh as expected.