Assessing assumptions
ASH and FDRreg differ substantially from the remaining methods, and care should be taken to verify that the appropriate inputs are available and that the underlying assumptions are indeed valid. Based on these criteria, both methods were excluded from most case studies and several simulation settings considered in this benchmark. A more detailed discussion of these assumptions is included below. For FDRreg and adaptglm, the informative covariate must be specified in the form of a model matrix or formula (respectively). In both cases, we use the same type of formula or model matrix used in the authors’ original publications [18, 19]. For lfdr, the informative covariate must be a discrete group label. We follow the implementation of lfdr developed in [15], which automatically bins the input covariate into 20 approximately equally sized bins before estimating the withingroup local FDR (source code to implement this procedure available on the GitHub repository linked in “Availability of data and materials” section [35]).
Common to all modern FDRcontrolling procedures included in Fig. 1 is the requirement that the informative covariate also be independent of the pvalue or test statistic under the null. That is, while the covariate should be informative of whether a test is truly positive, if a test is truly negative, knowledge of the covariate should not alter the validity of the pvalue or test statistic. Violation of this condition can lead to loss of type I error control and an inflated rate of false discoveries [34]. To avoid these pitfalls, previously proposed visual diagnostics were used to check both the informativeness and independence of the selected covariate [15].
Implementation of benchmarked methods
All analyses were implemented using R version 3.5.0 [24]. We used version 0.99.2 of the R package SummarizedBenchmark [36] to carry out the benchmark comparisons, which is available on GitHub at the “fdrbenchmark” branch at https://github.com/areyesq89/SummarizedBenchmark/tree/fdrbenchmark
[37].
While other modern FDRcontrolling methods have also been proposed, methods were excluded if accompanying software was unavailable or if the available software could not be run without substantial work from the user [38].
BH
Adjusted p values by BH were obtained using the p.adjust function from the stats base R package, with option method=~BH~.
q value
Storey’s qvalues were obtained using the qvalue function in version 2.12.0 of the qvalue Bioconductor R package.
IHW
Adjusted p values by IHW were obtained using the adj _pvalues function on the output of the ihw function, both from version 1.8.0 of the IHW Bioconductor R package.
BL
Adjusted p values by BL were obtained by multiplying BH adjusted p values (see above) by the π_{0,i} estimates obtained using the lm _pi0 function from version 1.6.0 of the swfdr Bioconductor R package.
lfdr
Adjusted p values by lfdr were obtained by first binning the independent covariate into 20 approximately equalsized groups using the groups _by_filter function the IHW R package. Next, the fdrtool function from version 1.2.15 of the fdrtool CRAN R package was applied to the p values within each covariate bin separately, with the parameter statistic="pvalue". We require that at least 200 tests per bin, as recommended by fdrtool. Note that we follow [15] and use fdrtool rather than the locfdr package recommended by [17] to obtain local false discovery rates, as the former may operate directly on p values instead of requiring zscores as in the latter.
FDRreg
For applications where the test statistic was assumed to be normally distributed, Bayesian FDRs were obtained by the FDRreg function from version 0.21 of the FDRreg R package (obtained from GitHub at https://github.com/jgscott/FDRreg). The features parameter was specified as a model matrix with a Bspline polynomial spline basis of the independent covariate with 3 degrees of freedom (using the bs function from the splines base R package) and no intercept. The nulltype was set to "empirical" or "theoretical" for the empirical and theoretical null implementations of FDRreg, respectively.
ASH
q values were obtained using the get _qvalue function on the output of the ash function, both from version 2.27 of the ashr R CRAN package. The effect sizes and their corresponding standard errors were input as the effect _size and sebetahat parameters, respectively.
AdaPT
q values were obtained using the adapt _glm function version 1.0.0 of the adaptMT CRAN R package. The pi _formulas and mu _formulas arguments were both specified as natural cubic Bspline basis matrices of the independent covariate with degrees of freedom ∈{2,4,6,8,10} (using the ns function from the splines base R package).
Yeast in silico experiments
Preprocessed RNAseq count tables from [29] were downloaded from the authors’ GitHub repository at https://github.com/bartongroup/profDGE48. All samples that passed quality control in the original study were included. All genes with a mean count of at least 1 across all samples were included, for a total of 6553 genes. Null comparisons were constructed by randomly sampling 2 groups of 5 and 10 samples from the same condition (Snf2knockout). Nonnull comparisons of the same size were constructed by adding differentially expressed (DE) genes in silico to null comparisons. In addition to different sample sizes, several different settings of the proportion of nonnull genes, the distribution of the nonnull effect sizes and informativeness of the covariate were explored. An overview of the different settings is provided in Additional file 1 :Table S2.
We evaluated the results using a low proportion of nonnull genes (500, or approximately 7.5% nonnull) as well as a high proportion (2000 or approximately 30% nonnull). The nonnull genes were selected using probability weights sampled from a logistic function (where weights \(w(u)=\frac {1}{1+e^{10u+5}}\), and u∼U(0,1)). Three types of informative covariates were explored: (1) strongly informative, (2) weakly informative, and (3) uninformative. The strongly informative covariate X_{s} was equal to the logistic sampling weight w. The weakly informative covariate X_{w} was equal to the logisitic sampling weight plus noise: w+ε, where ε∼N(0,0.25), truncated such that X_{w}∈[0,1]. The uninformative X_{u} covariate was unrelated to the sampling weights and drawn from a uniform distribution such that X_{u}∼U(0,1).
We also evaluated the results under two different distributions of nonnull effect sizes: (1) unimodal and (2) imodal. For unimodal alternative effect size distributions, the observed fold changes for the selected nonnull genes in a nonnull empirical comparison of the same sample size were used. For bimodal alternatives, observed test statistics z from an empirical nonnull comparison of the same sample size were sampled with probability weights w(z)=f(x;α,β), where f is the Gamma probability density function (with shape and rate parameters α=4.5 and β=1−1e^{−4}, respectively). The corresponding effect sizes (fold changes, FC) for ashq were calculated assuming a fixed standard error: FC=zσ_{m}, where σ_{m} is the median standard error of the log_{2} fold change across all genes.
To add differential signal to the designated nonnull genes, the expression in one randomly selected group was then multiplied by their corresponding fold change. Differential expression analysis using DESeq2 [39] was carried out on both the null and nonnull comparisons to assess specificity and sensitivity of the FDR correction methods. Genes for which DESeq2 returned NAp values were removed. In each setting, simulations were repeated 100 times and the average and standard error are reported across replications. The results displayed in the main manuscript contain 2000 DE genes, use the strongly informative covariate, and have a sample size of 5 in each group. The results for all settings are presented in Additional files 2, 3, 4, and 5.
Polyester in silico experiments
The yeast RNAseq data described in the previous section was used to estimate the model parameters using version 1.16.0 of the polyester [28] R Bioconductor package. All samples that passed quality control in the original study were included. A baseline group containing all the samples in the wildtype group was used, and genes with mean expression of less than 1 count were filtered out. Counts were library size normalized using DESeq2 size factors [39], and the get_params function from the polyester package was used to obtain the model parameters. Counts were simulated using the create_read_numbers function. Using the same sample sizes as the yeast in silico experiments (5 or 10 samples in each group), we evaluated a null comparison, where the beta parameters of the create_read_numbers function (which represent effect size) were set to zero for all genes. We also evaluated nonnull comparisons where the beta parameters were drawn from a standard normal distribution for 2000 nonnull genes. The nonnull genes were selected in the same way as the yeast in silico experiments described in the previous section. Differential expression analysis and evaluation of FDR correction methods was also carried out as described for the yeast experiments. Results are presented in Additional file 6.
Simulation studies
We performed Monte Carlo simulation studies to assess the performance of the methods with known ground truth information. In each simulation, M observed effect sizes, \(\{d_{i}\}_{i=1}^{M}\), and standard errors, \(\{s_{i}\}_{i=1}^{M}\), were sampled to obtain test statistics, \(\{t_{i} = d_{i} / s_{i}\}_{i=1}^{M}\). Letting \(\{\delta _{i}\}_{i=1}^{M}\) denote the true effect sizes, each t_{i} was tested against the null hypothesis \(H_{0}^{i}: \delta _{i} = 0\). Observed effect sizes and standard errors were simulated to obtain test statistics following one of four distributions under the null:

Standard normal distribution

t distribution with 11 degrees of freedom

t distribution with 5 degrees of freedom

χ^{2} distribution with 4 degrees of freedom.
For each test i, let h_{i} denote the true status of the test, with h_{i}=0 and h_{i}=1 corresponding to the test being null and nonnull in the simulation. True effect sizes, δ_{i}, were set to 0 for {ih_{i}=0} and sampled from an underlying nonnull effect size distribution for {ih_{i}=1}. For normal and t distributed test statistics, observed effect sizes were simulated by adding standard normal noise to the true effect sizes such that d_{i}∼N(δ_{i},1). The standard errors were all set to 1 to obtain normal test statistics and set to \(s_{i} = \sqrt {v_{i} / \nu }\) with each \(v_{i}\sim \chi ^{2}_{\nu }\) independent to obtain t statistics with ν degrees of freedom. For χ^{2} test statistics, observed effect sizes were sampled from noncentral χ^{2} distributions with noncentrality parameters equal to the true effect sizes such that \(d_{i} \sim \chi ^{2}_{4}(ncp = \delta _{i})\). Standard errors were not used to simulate χ^{2} statistics and were simply set to 1. The pvalue was calculated as the twotail probability of the sampling distribution under the null for normal and t statistics. The uppertail probability under the null was used for χ^{2} statistics.
In all simulations, independent covariates, \(\{x_{i}\}_{i = 1}^{M}\), were simulated from the standard uniform distribution over the unit interval. In the uninformative simulation setting, the \(\{h_{i}\}_{i=1}^{M}\) were sampled from a Bernoulli distribution according to the marginal null proportion, \(\bar \pi _{0}\), independent of the {x_{i}}. In all other settings, the \(\{h_{i}\}_{i=1}^{M}\) were sampled from Bernoulli distributions with testspecific probabilities determined by the informative covariates through a function, p(x_{i}), taking values in [0,1]. Several forms of p(x_{i}) were considered in the simulations. The p(x_{i}) were chosen to explore a range of relationships between the covariate and the null probability of a test. For further flexibility, the functional relationships were defined conditional on the marginal null probability, \(\bar \pi _{0}\), so that similar relationships could be studied across a range of \(\bar \pi _{0}\). The following \(p(x_{i};\bar \pi _{0})\) relationships, shown in Additional file 1: Figure S5A for \(\bar \pi _{0} = 0.90\), were investigated in the simulations.
$$\begin{array}{*{20}l} p^{\text{cubic}}(x;\bar\pi_{0}) &= 4(1\pi_{0})(1x)^{1/3} + 4 \pi_{0}  3 \\ p^{\text{step}}(x;\bar\pi_{0}) &= \left\{ \begin{array}{ll} \bar\pi_{0}/2  1/2 & \text{if } x \in [0, 1/4) \\ \bar\pi_{0}/4  1/4 & \text{if } x \in [1/4, 1/2) \\ \bar\pi_{0}/4 + 1/4 & \text{if } x \in [1/2, 3/4) \\ \bar\pi_{0}/2 + 1/2 & \text{if } x \in [3/4, 1] \end{array}\right. \\ p^{\text{sine}}(x;\bar\pi_{0}) &= \left\{ \begin{array}{l} \bar\pi_{0}  \bar\pi_{0} \sin(2\pi \cdot x) \\ \ \ \text{if } \bar\pi_{0} \in [0, 1/2) \\ \bar\pi_{0} + (1\bar\pi_{0}) \sin(2\pi \cdot x) \\ \ \ \text{if } \bar\pi_{0} \in [1/2, 1] \end{array}\right. \\ p^{\text{cosine}}(x;\bar\pi_{0}) &= \left\{ \begin{array}{l} \bar\pi_{0}  \bar\pi_{0} \cos(2\pi \cdot x) \\ \ \ \ \text{if } \bar\pi_{0} \in [0, 1/2) \\ \bar\pi_{0} + (1\bar\pi_{0}) \cos(2\pi \cdot x) \\ \ \ \ \text{if } \bar\pi_{0} \in [1/2, 1] \end{array}\right. \end{array} $$
The functions p^{cubic} and p^{step} are valid, i.e., map to [0,1], for \(\bar \pi _{0}\in [3/4, 1]\) and \(\bar \pi _{0}\in [1/2,1]\), respectively. All other p(x_{i}) are valid for \(\bar \pi _{0}\in [0,1]\). In addition to these functional relationships, we also considered two specialized relationships with \(\bar \pi _{0}\) fixed at 0.80. These specialized relationships were parameterized by an informativeness parameter, δ∈[0,1], such that when δ=0, the covariate was completely uninformative and stratified the hypotheses more effectively as δ increased.
$$\begin{array}{*{20}l} p^{\text{cinfo}}(x;\delta) &= 0.8 \cdot (1  \delta) + \delta / (1+e^{5  25x}) \\ p^{\text{dinfo}}(x;\delta) &= \left\{ \begin{array}{ll} 0.2 \cdot (4 + \delta) & \text{if } x \in [0, 4/5] \\ 0.8 \cdot (1  \delta) & \text{if } x \in (4/5, 1] \end{array}\right. \end{array} $$
The first, p^{cinfo}, is a continuous relationship between the covariate, x and the null proportion, π_{0}, while the second, p^{dinfo}, is discrete. The results of simulations with p^{cinfo} are shown in Additional file 1: Figure S4AB, where the “informativeness” axis is simply 100·δ. The covariate relationships, p^{cinfo} and p^{dinfo}, are shown in Additional file 1: Figure S9 across a range of δ informativeness values.
Simulations were formulated and performed as several distinct case studies, with full results presented in Additional files 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, and 20. The complete combination of settings used in each simulation case study is given in Additional file 1: Table S3. In each case study, simulations were repeated 100 times, and performance metrics are reported as the average across the replications.
FDRreg and ASH were excluded from simulations with χ^{2} distributed test statistics because the setting clearly violated the assumptions of both methods (Fig. 1).
Case studies
We explored several real case studies using publicly available datasets. Unless otherwise stated, we use an α = 0.05 level to define a positive test. In all cases, we also evaluate the results using an independent but uninformative covariate (sampled from a standard uniform distribution). The locations of where the data can be found and the main details for each casestudy is described below. For more specific analysis details, please refer to Additional files 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, and 41 which contain complete reproducible code sets.
Genomewide association study
GWAS analysis results were downloaded from http://portals.broadinstitute.org/collaboration/giant/images/3/3a/BMI.SNPadjSMK.zip
[40, 41]. We used the results subset by European ancestry provided in the file BMI.SNPadjSMK.CombinedSexes.EuropeanOnly.txt to avoid the impact of population stratification on our results. We followed [42] and implemented a linkage disequilibrium (LD)based pruning step (using the clump command of PLINK v1.90b3s [43]) to remove SNPs in high LD (r^{2}<0.2) with any nearby SNP (< 250 Kb), based LD estimates from the 1000 Genomes phase three CEU population data [44], available at http://neurogenetics.qimrberghofer.edu.au/iSECA/1000G_20101123_v3_GIANT_chr1_23_minimacnamesifnotRS_CEU_MAF0.01.zip. We explored the use of both sample size and minor allele frequency for each SNP as informative covariates. For fdrrege and fdrregt, which require a normally distributed test statistic as input, the t statistic (effect size divided by standard error) was used. Because of the large sample sizes in this study (median 161,165), the t statistics were approximately normal. For ashq, we used the provided effect size and standard error of the test statistics. Full results are provided in Additional file 21.
Gene set analysis
We used two RNAseq datasets that investigated the changes in gene expression (1) between the cerebellum and cerebral cortex of 5 males from GenotypeTissue Expression Project [45] and (2) upon differentiation of hematopoietic stem cells (HSCs) into multipotent progenitors (MPP) [46] (processed data for both available at https://doi.org/10.5281/zenodo.1475409 [47]). For the independent and informative covariate, we considered the size of the gene sets. We considered two different gene set analysis methods: gene set enrichment analysis (GSEA) [48] and overrepresentation testing [49]. To implement the overrepresentation test, we first used version 1.20.0 of the DESeq2 R Bioconductor package to obtain a subset of differentially expressed genes (with adjusted pvalue < 0.05), on which a test of overrepresentation of DE genes among gene sets was performed using version 1.32.0 of the goseq R Bioconductor package [49]. To implement GSEA, we used version 1.6.0 of the fgsea R Bioconductor package [50] and used the DESeq2 test statistics to rank the genes. For both methods, Gene Ontology categories obtained using version 2.36.1 of the biomaRt R Bioconductor package containing at least 5 genes were used for the gene sets. For fgsea, 10,000 permutations were used and gene sets larger than 500 genes were excluded as recommended in the package documentation. The methods fdrrege, fdrregt, and ashq were excluded since they require test statistics and/or standard errors that GSEA does not provide. For goSeq, we also filtered on gene sets containing at least one DE gene. Full results are provided in Additional files 22, 23, 24, and 25.
Bulk RNAseq
We used two RNAseq datasets to asses the performance of modern FDR methods in the context of differential expression. The first dataset consisted of 20 paired samples of the GTEx project. These 20 samples belonged to 2 tissues (Nucleus accumbens and Putamen) of 10 female individuals. These samples were preprocessed as described in [51] (processed data available at https://doi.org/10.5281/zenodo.1475409 [47]). We used a second dataset from an experiment in which the microRNA mir200c was knocked down in mouse cells [52]. The transcriptomes of knockdown cells and control cells were sequenced in biological duplicates. The processed samples of the knockdown experiment were downloaded from the recount2 database available at http://duffel.rail.bio/recount/v2/SRP030475/rse_gene.Rdata [53, 54]. For each dataset, we tested for differential expression using DESeq2. For FDR methods that can use an informative covariate, we used mean expression across samples, as indicated in the DESeq2 vignette. Full results are provided in Additional files 26 and 27.
Singlecell RNAseq
We selected two datasets from the conquer [55] database available at http://imlspenticton.uzh.ch/robinson_lab/conquer/datamae/GSE84465.rds
and http://imlspenticton.uzh.ch/robinson_lab/conquer/datamae/GSE94383.rds [56]. First, we detected differentially expressed genes between glioblastoma cells sampled from a tumor core with those from nearby tissue (GSE84465) [57]. In addition, we detected DE genes between murine macrophage cells stimulated to produce an immune response with an unstimulated population (GSE94383) [58]. We filtered out cells with a mapping rate less than 20% or fewer than 5% of genes detected. Genes detected in at least 5% of cells were used in the analysis and spikein genes were excluded. We carried out DE analyses using two different methods developed for scRNAseq: scDD [59] and MAST [60], along with the Wilcoxon ranksum test. MAST was applied to log2(TPM + 1) counts using version 1.6.1 of the MAST R Bioconductor package. scDD was applied to raw counts normalized by version 1.8.2 of the scran R Bioconductor package [61] using version 1.4.0 of the scDD R Bioconductor package. Wilcoxon was applied to counts normalized by TMM (using version 3.22.3 of the edgeR R Bioconductor package [62]) using the wilcox.test function of the stats base R package. We examined the mean nonzero expression and detection rate (defined as the proportion of cells expressing a given gene) as potentially informative covariates. The fdrrege and fdrregt methods were excluded since none of the test statistics used are normally distributed. Likewise, ashq was excluded since none of the methods considered provide effect sizes and standard errors. Full results are provided in Additional files 28, 29, 30, 31, 32, and 33.
ChIPseq
ChIPseq analyses were carried out on two separate datasets. First, H3K4me3 data from two cell lines (GM12878 and K562) were downloaded from the ENCODE data portal [63] at http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeBroadHistone/ [64]. In each cell line, four replicates were selected with half of them from one laboratory and the other half from another laboratory. We performed two types of differential binding analyses between cell lines. First, following the workflow of [65], we used the csaw [66] method (version 1.14.1 of the csaw Bioconductor R package) to identify candidate de novo differential windows and applied the edgeR [62] method (version 3.22.3 of the edgeR R Bioconductor package) to test for significance. Second, we tested for differential binding only in predefined promoter regions using DESeq2 [39]. Promoter regions were obtained from the UCSC “Known Gene” annotations for human genome assembly hg19 (GRCh37). The csawbased analysis was also carried out on a second ChIPseq dataset comparing CREBbinding protein in wildtype and CREB knockout mice [65, 67] (GSE54453) obtained from the European Nucleotide Archive at https://www.ebi.ac.uk/ena/data/view/PRJNA236594
[68]. For the csawbased analyses, we used the region width as the informative covariate. For the promoter regionbased analysis, we used mean coverage as the informative covariate. The fdrrege and fdrregt methods were excluded from csaw analyses since the test statistics used are non normally distributed. Likewise, ashq was excluded since csaw does not provide effect sizes and standard errors. Full results are provided in Additional files 34, 35, and 36.
Microbiome
We performed two types of analyses (1) differential abundance analysis, and (2) correlation analysis. For the differential abundance analyses, we used four different datasets from the MicrobiomeHD database [69] available at https://doi.org/10.5281/zenodo.840333 [70]: (1) obesity [71], (2) inflammatory bowel disease (IBD) [72], (3) infectious diarrhea (clostridium difficile (CDI) and nonCDI) [73], and (4) colorectal cancer (CRC) [74]. These studies were processed as described in [69]. We performed Wilcoxon ranksum differential abundance tests on the operational taxonomic units (OTUs, sequences clustered at 100% genetic similarity) and on taxa collapsed to the genus level as in [69]. Full results are provided in Additional files 37, 38, 39, and 40.
For the correlation analyses, we used a previously published dataset of microbial samples from monitoring wells in a site contaminated by former waste disposal ponds, where all sampled wells have various geochemical and physical measurements [75]. Pairedend reads were merged using PEAR (version 0.9.10) and demultiplexed with QIIME v 1.9.1 split_libraries_fastq.py (maximum barcode error of 0 and quality score cutoff of 20) [76, 77]. Reads were dereplicated using USEARCH v 9.2.64 fastx_uniques, and operational taxonomic units (OTUs) were called with cluster_otus and an identity threshold of 0.97 [78]. These data were processed with the amplicon sequence analysis pipeline http://zhoulab5.rccc.ou.edu/pipelines/ASAP_web/pipeline_asap.php and are available at https://doi.org/10.5281/zenodo.1455793 [79]. We carried out a Spearman correlation test (H_{0}:ρ=0) between OTU relative abundances across wells and the respective values of three geochemical variables: pH, Al, and SO_{4}. Full results are provided in Additional file 41.
For all analyses, we examined the ubiquity (defined as the percent of samples with nonzero abundance of each taxa) and mean nonzero abundance of taxa as potentially informative covariates. The results in the main manuscript are shown for the OTU level, unless there were no rejections in the vast majority of methods, and then the results are shown for genus level. The SO_{4} dataset was also excluded from the main results since most methods find no rejections. We excluded fdrrege, fdrregt, and ashq since neither the Wilcoxon nor Spearman test statistics are normally distributed, nor do they provide an effect size and standard error. Due to small numbers of tests, lfdr was excluded from the obesity and IBD genus level analyses.
Evaluation metrics
All studies (in silico experiments, simulations, and case studies) were evaluated on the number of rejections at varying α levels, ranging from 0.01 to 0.10. The overlap among rejection sets for each method was examined using version 1.3.3 of the UpSetR R CRAN package. In silico experiments and simulation studies were also evaluated on the following metrics at varying α levels, ranging from 0.01 to 0.10: TPR, observed FDR, and true negative rate (TNR). Here, we define TPR as the number of true positives out of the total number of nonnull tests, observed FDR as the number of false discoveries out of the total number of discoveries (defined as 0 when there are no discoveries), and TNR as the number of true negatives out of the total number of null tests.
Summary metrics
The final ratings presented in Fig. 6 were determined from the aggregated results of the simulations, yeast experiments, and case studies.
FDR control
Ratings were determined using the results from nonnull simulations where all methods were applied, i.e., excluding χ^{2} settings, and all nonnull yeast experiments. In each setting or experiment, a method was determined to control FDR at the nominal 5% cutoff if the mean FDR across replications was less than one standard error above 5%. The following cutoffs were used to determine superior, satisfactory, and unsatisfactory methods.

Superior: failed to control FDR in less than 10% of settings in both simulations and yeast experiments.

Satisfactory: failed to control FDR in less than 10% of settings in either simulations or yeast experiments.

Unsatisfactory: otherwise.
The computed proportion of simulation and yeast settings exceeding the nominal FDR threshold are shown in Fig. 5a.
Power
Similar to the above, ratings were determined using the results from nonnull simulations where all methods were applied and all nonnull yeast experiments. In each setting or experiment, methods were ranked in descending order according to the mean TPR across replications at the nominal 5% FDR cutoff. Ties were set to the intermediate value, e.g., 1.5, if two methods tied for the highest TPR. The mean rank of each method was computed across simulation settings and yeast experiments separately and used to determine superior, satisfactory, and unsatisfactory methods according to the following cutoffs.

Superior: mean TPR rank less than 5 (of 8) in both simulations and yeast experiments.

Satisfactory: mean TPR rank less than 6 (of 8) in both simulations and yeast experiments.

Unsatisfactory: otherwise.
Mean TPR ranks for methods across simulation and yeast settings are shown in Fig. 5b.
Consistency
Ratings were determined using results from nonnull simulations where all methods were applied and all case studies. Here, ashq and fdrregt were excluded from metrics computed using the case studies, as the two methods were only applied in 4 of the 26 case studies. In each simulation setting, the TPR and FDR of each covariateaware method were compared against the TPR and FDR of both the BH approach and Storey’s qvalue at the nominal 5% FDR cutoff. Similarly, in each case study, the number of rejections of each method was compared against the number of rejections of BH and Storey’s qvalue. Based on these comparisons, two metrics were computed for each method.
First, in each setting and case study, a modern method was determined to underperform the classical approaches if the TPR or number of rejections was less than 95% of the minimum of the BH approach and Storey’s qvalue. The proportion of simulation settings and case studies where a modern method underperformed was used to determine the consistent stability of an approach (Fig. 5c). The FDR of the methods was not used for this metric.
Second, in each simulation setting and case study, the logratio FDR, TPR, and number of rejections were computed against a baseline for each modern method. For each setting, the average across the classical methods was used as the baseline. Methods were then ranked according to the standard deviation of these log ratios, capturing the consistency of the methods across simulations and case studies (Fig. 5d).
These two metrics were used to determine superior, satisfactory, and unsatisfactory methods according to the following cutoffs.

Superior: in top 50% (3) of methods according to variance of logratio metrics.

Satisfactory: not in top 50% (3) of methods according to variance of logratio metrics, but underperformed both BH and Storey’s qvalue in less than 10% of case studies or simulation settings.

Unsatisfactory: not in top 50% (3) of methods according to variance of logratio metrics, and underperformed both BH and Storey’s qvalue in more than 10% of case studies or simulation settings.
Applicability
Ratings were determined using the proportion of case studies in which each method could be applied. The proportion was calculated first within each type of case studies, followed by averaging across all case studies. This was done to prevent certain types, e.g., scRNAseq DE analysis, from dominating the average. The following cutoffs were used to determine superior, satisfactory, and unsatisfactory methods. For this metric, cases when adaptglm returned exactly zero positive tests while all other methods returned nonzero results where labeled as data sets where the method could not be applied. This is denoted by an asterisk in Fig. 6.

Superior: applied in 100% of case studies.

Satisfactory: applied in more than 50% of case studies.

Unsatisfactory: otherwise.
Usability
Ratings were determined based on our experience using the method for our benchmark comparison and rated according to the following criteria.

Superior: a welldocumented implementation is available.

Satisfactory: an implementation is available, but lacks extensive documentation or requires additional work to install.

Unsatisfactory: no implementation is readily available.
Open Access
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Additional files
Additional file 1 is available as supplementary material in Genome Biology and files 241 are available on GitHub at https://pkimes.github.io/benchmarkfdrhtml/ [30].