Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data

Calgaro, Matteo; Romualdi, Chiara; Waldron, Levi; Risso, Davide; Vitulo, Nicola

doi:10.1186/s13059-020-02104-1

Research
Open access
Published: 03 August 2020

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data

Matteo Calgaro¹,
Chiara Romualdi²,
Levi Waldron³,
Davide Risso ORCID: orcid.org/0000-0001-8508-5012⁴^na1 &
…
Nicola Vitulo¹^na1

Genome Biology volume 21, Article number: 191 (2020) Cite this article

14k Accesses
55 Citations
65 Altmetric
Metrics details

Abstract

Background

The correct identification of differentially abundant microbial taxa between experimental conditions is a methodological and computational challenge. Recent work has produced methods to deal with the high sparsity and compositionality characteristic of microbiome data, but independent benchmarks comparing these to alternatives developed for RNA-seq data analysis are lacking.

Results

We compare methods developed for single-cell and bulk RNA-seq, and specifically for microbiome data, in terms of suitability of distributional assumptions, ability to control false discoveries, concordance, power, and correct identification of differentially abundant genera. We benchmark these methods using 100 manually curated datasets from 16S and whole metagenome shotgun sequencing.

Conclusions

The multivariate and compositional methods developed specifically for microbiome analysis did not outperform univariate methods developed for differential expression analysis of RNA-seq data. We recommend a careful exploratory data analysis prior to application of any inferential model and we present a framework to help scientists make an informed choice of analysis methods in a dataset-specific manner.

Background

Study of the microbiome, the uncultured collection of microbes present in most environments, is a novel application of high-throughput sequencing that shares certain similarities but important differences from other applications of DNA and RNA sequencing. Common approaches for microbiome studies are based on deep sequencing of amplicons of universal marker-genes, such as the 16S rRNA gene, or on whole metagenome shotgun sequencing (WMS). Community taxonomic composition can be estimated from microbiome data by assigning each read to the most plausible microbial lineage using a reference annotated database, with a higher taxonomic resolution in WMS than in 16S [1, 2]. The final output of such analyses usually consists of a large, highly sparse taxa per sample count table.

Differential abundance (DA) analysis is one of the primary approaches to identify differences in the microbial community composition between samples and to understand the structures of microbial communities and the associations between microbial compositions and the environment. DA analysis has commonly been performed using methods adapted from RNA sequencing (RNA-seq) analysis; however, characteristics specific to microbiome data make differential abundance analysis challenging. Compared to other high-throughput sequencing techniques such as RNA-seq, metagenomic data are sparse, i.e., the taxa count matrix contains many zeros. This sparsity can be explained by both biological and technical reasons: some taxa are very rare and present only in a few samples, while others are very lowly represented and cannot be detected because of an insufficient sequencing depth or other technical reasons.

In recent years, single-cell RNA-seq (scRNA-seq) has revolutionized the field of transcriptomics, providing new insight on the transcriptional program of individual cells, casting light on complex, heterogeneous tissues, and revealing rare cell populations with distinct gene expression profiles [3,4,5,6]. However, due to the relatively inefficient mRNA capture rate, scRNA-seq data are characterized by dropout events, which leads to an excess of zero read counts compared to bulk RNA-seq data [7, 8]. Thus, with the advent of this technology, new statistical models accounting for dropout events have been proposed. The similarities with respect to sparsity observed in both scRNA-seq and metagenomics data led us to pose the question of whether statistical methods developed for the differential expression of scRNA-seq data perform well on metagenomic DA analysis.

Some benchmarking efforts have compared the performance of methods [9,10,11,12] both adapted from bulk RNA-seq and developed for microbiome DA [13, 14]. While some tools exist to guide researchers [15], a general consensus on the best approach is still missing, especially regarding the methods’ capability of controlling false discoveries. In this study, we benchmark several statistical models and methods developed for metagenomics [13, 14, 16,17,18], bulk RNA-seq [19,20,21], and, for the first time, single-cell RNA-seq [7, 8, 22,23,24] on a collection of manually curated 16S and WMS [25, 26] real data as well as on a comprehensive set of simulations. We include in the comparison several tools that take into account the compositional nature of the data: they achieve this through the use of the Dirichlet-Multinomial Distribution (e.g., ALDEx2), Multinomial Distribution with reference frames (Songbird), or the Centered Log Ratio (CLR) transformation (e.g., ALDEx2, mixMC). The novelty of our benchmarking efforts is twofold. First, we include in the comparison novel methods recently developed in the scRNA-seq and metagenomics literatures; second, unlike previous efforts, our conclusions are based on several performance metrics on real data that range from type I error control and goodness of fit to replicability across datasets, concordance among methods, and enrichment for expected DA microbial taxa.

Results

We benchmarked a total of 18 approaches (Additional file 1: Supplementary Table 2) on 100 real datasets (Additional file 1: Supplementary Table 1), evaluating goodness of fit, type I error control, concordance, and power through (i) reliability of DA results in real data based on enrichment analysis and (ii) specificity and sensitivity using 28,800 simulated datasets (Fig. 1; Additional file 2: Supplementary Table 4).

The benchmarked methods include both DA methods specifically proposed in the metagenomics literature and methods proposed in the single-cell and bulk RNA-seq fields. The manually curated real datasets span a variety of body sites and characteristics (e.g., sequencing depth, alpha and beta diversity). The diversity of the data allowed us to test each method on a variety of circumstances, ranging from very sparse, very diverse datasets, to less sparse, less diverse ones.

We first analyzed 18 16S, 82 WMS, and 28 scRNA-seq public datasets in order to assess whether scRNA-seq and metagenomic data are comparable in terms of sparsity. We observed overlap in the fractions of zero counts between the scRNA-seq, WMS, and 16S, but with scRNA-seq datasets having a lower distribution of sparsities (ranging from 12 to 75%) as compared to 16S (ranging from 55 to 83%) and WMS datasets (ranging from 35 to 89%) whose distributions of zero frequencies were not significantly different from each other (Wilcoxon test, W = 734, p = 0.377, Additional File 1: Supplementary Fig. S1a-b). To establish whether the difference between scRNA-seq and metagenomic data was due to the different number of features and samples, which are intrinsically related to sparsity, we explored the role of library size and experimental protocol (Additional File 1: Supplementary Fig. S1c). scRNA-seq datasets showed a marked difference in terms of the number of features and sparsity, as they are derived from different experimental protocols. Full-length data (e.g., Smart-seq) are on average sparser than droplet-based data (e.g., Drop-seq) but both are less sparse than 16S and WMS.

These results indicate that metagenomic data are even more sparse than scRNA-seq, and thus that zero-inflated models designed for scRNA-seq could at least in principle have good performance in a metagenomic context.

Goodness of fit

As different methods rely on different statistical distributions to perform DA analysis, we started our benchmark by assessing the goodness of fit (GOF) of the statistical models underlying each method on the full set of 16S and WMS data. For each model, we evaluated its ability to correctly estimate the mean counts and the probability of observing a zero (Fig. 2). We evaluated five distributions: (1) the negative binomial (NB) used in edgeR [19] and DeSeq2 [20], (2) the zero-inflated negative binomial (ZINB) used in ZINB-WaVE [23], (3) the truncated Gaussian Hurdle model of MAST [7], (4) the zero-inflated Gaussian (ZIG) mixture model of metagenomeSeq [13], and (5) the Dirichlet-Multinomial (DM) distribution underlying ALDEx2 [14]. The truncated Gaussian Hurdle model was evaluated following two data transformations, the default logarithm of the counts per million (logCPM) and the logarithm of the counts rescaled by the median library size (see the “Methods” section). Similarly, the ZIG distribution was evaluated considering the scaling factors rescaled by either one thousand (as implemented in the metagenomeSeq Bioconductor package) and by the median scaling factor (as suggested in the original paper). We assessed the goodness of fit for each of these models using the stool samples from the Human Microbiome Project (HMP) as a representative dataset (Fig. 2a–d); all other datasets gave similar results (Additional file 1: Supplementary Fig. S2). A useful feature of this dataset is that a subset of samples was processed both with 16S and WMS and hence can be used to compare the distributional differences of the two data types. Furthermore, this dataset includes only healthy subjects in a narrow age range, providing a good testing ground for covariate-free models.

The NB distribution showed the lowest root mean square error (RMSE, see the “Methods” section) for the mean count estimation, followed by the ZINB distribution (Fig. 2a, b). This was true for both 16S and WMS data, in most of the considered datasets (Additional file 1: Supplementary Fig. S2). Moreover, for both distributions, the difference between the estimated and observed means was symmetrically distributed around zero, indicating that the models did not systematically under- or overestimate the mean abundances (Fig. 2a, b; Additional file 1: Supplementary Fig. S2). Conversely, the ZIG distribution consistently underestimated the observed means, both for 16S and WMS and independently on the scaling factors (Fig. 2a, b). The Hurdle model was sensitive to the choice of the transformation: rescaling by the median library size rather than by one million reduced the RMSE in both 16S and WMS data (Fig. 2a, b). This was particularly evident in 16S data (Fig. 2a), in which the default logCPM values resulted in a substantial overestimation of the mean count, while the median library size scaling led to underestimation. Given the clear problems with logCPM, we only used the median library size for MAST and the median scaling factor for metagenomeSeq in all subsequent analyses. The DM distribution overestimated observed means for low-mean count features and underestimated observed values for high-mean count features. This overestimation effect was more evident in WMS than in 16S.

Concerning the ability of models to estimate the probability of observing a zero (referred to as zero probability difference, ZPD), we found that Hurdle models provided good estimates of the observed zero proportion for 16S (Fig. 2c) and WMS datasets (Fig. 2d). The NB and ZINB distributions, on the other hand, tended to overestimate the zero probability for features with a low observed proportion of zero counts in 16S (Fig. 2c). In WMS data, the ZINB distribution perfectly fitted the observed proportion of zeros, while the NB and DM models tended to underestimate it (Fig. 2d). Finally, the ZIG distribution always underestimated the observed proportion of zeros, especially for highly sparse features (Fig. 2c, d).

In summary, across all datasets, the best fitting distributions were the NB and ZINB: the NB distribution seemed to be particularly well-suited for 16S datasets, while the ZINB distribution seemed to better fit WMS data (Fig. 2e). We hypothesize that this is due to the different sequencing depths of the two platforms. In fact, while our 16S datasets have an average of 4891 reads per sample, in WMS, the mean depth is 3.6 × 10⁸ (3 × 10⁸ for HMP). To confirm this observation, we carried out a simulation experiment by down-sampling reads from deep-sequenced WMS samples (rarefaction): while the need for zero inflation seemed to diminish as we got closer to the number of reads typical of the corresponding 16S experiments, the profile did not completely match between approaches (Additional file 1: Supplementary Fig. S4b). This suggests that, while sequencing depth is an important contributing factor, it is not enough to completely explain the distributional differences between the two platforms.

Type I error control

We next sought to evaluate type I error rate control of each method, i.e., the probability of the statistical test to call a feature DA when it is not. To do so, we considered mock comparisons between the same biological Stool HMP samples (using the same Random Sample Identifier in both 16S and WMS), in which no true DA is present. Briefly, we randomly assigned each sample to one of two experimental groups and performed DA analysis between these groups, repeating the process 1000 times (see the “Methods” section for additional details). In this setting, the p values of a perfect test should be uniformly distributed between 0 and 1 (ref. [27]) and the false positive rate (FPR or observed α), which is the observed proportion of significant tests, should match the nominal value (e.g., α = 0.05).

To evaluate the impact of both the normalization step and the estimation and testing step in bulk RNA-seq inspired methods, we included in the comparison both edgeR with its default normalization (TMM), as well as with DESeq2 recommended normalization (“poscounts,” i.e., the geometric mean of the positive counts) and vice versa (Table S2). Similarly, because the zinbwave observational weights can be used to apply several bulk RNA-seq methods to single-cell data [24], we have included in the comparison edgeR, DESeq2, and limma-voom with zinbwave weights.

The qq-plots and Kolmogorov-Smirnov (KS) statistics in Fig. 3 show that most methods achieved a p value distribution reasonably close to the expected uniform. Notable exceptions in the 16S experiment were edgeR with TMM normalization and robust dispersion estimation (edgeR_TMM_robustDisp), metagenomeSeq, and ALDEx2 (Fig. 3a, b). While the former two appeared to employ liberal tests, the latter was conservative in the range of p values that are typically of interest (0–0.1). In the WMS data, departure from uniformity was observed for metagenomeSeq and edgeR_TMM_robustDisp, and limma_voom_TMM_zinbwave, which employed liberal tests, as well as corncob_LRT, ALDEx2, and scde, which were conservative in the range of interest (Fig. 3c, d). We note that in the context of DA, liberal tests will lead to many false discoveries, while conservative tests will control the type I error at a cost of reduced power, potentially hindering true discoveries.

We next recorded the FPR by each method (by definition all discoveries are false positives in this experiment) and compared it to its expected nominal value. This analysis confirmed the tendencies observed in Fig. 3a, b and c, d. In particular, edgeR_TMM_robustDisp and metagenomeSeq were very liberal in both 16S (Fig. 3e) and WMS data (Fig. 3f); in the case of metagenomeSeq, as much as 30% of the features were deemed DA in the 16S datasets when claiming a nominal FPR of 5% (Fig. 3e). ALDEx2, scde, and MAST, albeit conservative, were able to control type I error. In between these two extremes, edgeR, DESeq2, and limma showed an observed FPR slightly higher than its nominal value. In particular, DESeq2-based methods, limma-voom, and MAST were very close to the nominal FPR for 16S (Fig. 3e), while limma-voom, MAST, and corncob (with Wald test) were the closest in WMS data (Fig. 3f). Of note, corncob seemed slightly conservative in WMS data and slightly liberal in 16S data, with LRT being closer than Wald to the nominal value in 16S (Fig. 3e) and vice versa in WMS data (Fig. 3f). The zinbwave weights showed mixed results: DESeq2 with zinbwave weights was better than the unweighted versions in WMS, while the weights did not help edgeR and limma in controlling the type I error rate. Taken together, these results suggest that the majority of the methods do not control the type I error rate, both in 16S and WMS data, confirming previous findings [10, 12]. However, for most approaches, the observed FPR is only slightly higher than its nominal value, making the practical impact of this result unclear.

Between-method concordance

We measured the ability of each method to produce replicable results in independent data in six datasets [25, 26, 28,29,30] (Additional file 1: Supplementary Table S3) that showed different alpha and beta diversity, as well as different amounts of DA between two experimental conditions (Additional file 1: Supplementary Fig. S5). Each dataset was randomly split in two equally sized subsets and each method was separately applied to each subset. The process was repeated 100 times (see the “Methods” section for details). To assess the ability of methods to return concordant results from independent samples, we employed the Concordance At the Top [31](CAT) measure to assess between-method concordance (BMC) by comparing the list of DA features across methods in the subset (ranked by p value when available or by importance in the case of the songbird and mixMC; see Methods). We used BMC to (i) group methods based on their degree of agreement and (ii) identify those methods sharing the largest amount of discoveries with the majority of the other methods. Although concordance is not a guarantee of validity, it is a requirement of validity, so methods sharing the largest amount of discoveries with the majority of other methods may be more likely to also be producing valid results.

Concordance analysis performed on 16S Tongue Dorsum vs. Stool dataset (Fig. 4a) showed that the methods clustered within two distinct groups: the first comprising all methods that include a TMM normalization step, songbird, and scde, the second containing all the other approaches (Fig. 4a). Even within the second group, methods segregated by normalization, as can be seen by the tight clustering of all the methods that include a poscount normalization step (Fig. 4a). This indicates that, in 16S data, the choice of the normalization has a pronounced effect on inferential results, even more so than the choice of the statistical test. A similar result was previously observed in bulk RNA-seq data [32]. The use of observational weights to account for zero inflation did not seem to matter in these data, and in general, scRNA-seq methods did not agree with each other (Fig. 4a). Similarly, the clustering did not separate compositional and non-compositional methods (Fig. 4a). We noted that metagenomeSeq was not concordant with any other method and that the two corncob approaches formed a tight group, confirming that modeling strategies have more impact than the choice of the test statistics in these data.

A different picture emerged from the analysis of the WMS data (Fig. 4b). Here, methods are clustered by the testing approach. The bottom cluster comprised the bulk RNA-seq methods with the inclusion of the Wilcoxon nonparametric approach, metagenomeSeq, and mixMC. The middle cluster consisted of the zinbwave methods and ALDEx2. The top cluster comprised MAST, corncob, scde, and songbird. Overall, mixMC and the methods based on NB generalized linear models showed the highest BMC values. When observational weights were added to those models, the BMC decreased, but still a good level of concordance was observed with their respective unweighted version.

We noted that the BMC is highly dataset-specific and depends on the amount of DA between the compared groups. Indeed, BMC decreased with decreased beta diversity of the dataset, and the role of normalization became less clear (Additional file 1: Supplementary Fig. S6).

Within-method concordance

The CAT metric was used again for assessing the within-method concordance (WMC), i.e., the amount of concordance of the results of each method on the two random subsets.

WMC was clearly dataset-dependent, showing high levels of concordance in datasets with a high differential signal (e.g., tongue vs. stool, Fig. 5a) and low concordance in datasets with a low differential signal (e.g., supragingival vs. subgingival, Fig. 5e). Overall, the replicability of results in WMS studies was slightly higher than that of 16S datasets.

In terms of method comparison, corncob showed high levels of concordance in WMS datasets but lower concordance in all 16S datasets (Fig. 5). Similarly, songbird showed the highest concordance in mid (Fig. 5d) and low (Fig. 5f) diversity WMS datasets but did not perform well in 16S (especially for the highly diverse TongueDorsum vs. Stool comparison; Fig. 5a).

The addition of zinbwave weights to edgeR, DESeq2, and limma-voom did not always help: it was sometimes detrimental, e.g., for edgeR in the schizophrenia dataset (Fig. 5d) and sometimes led to an improvement in replicability, e.g., for limma-voom in the Tongue Dorsum vs. Stool dataset (Fig. 5a). The schizophrenia dataset had the lowest sample size among all the datasets evaluated, suggesting that sample size may play an important role in estimating zinbwave weights.

While this analysis confirmed the unsatisfactory performance of metagenomeSeq (Fig. 5a, b, and f), ALDEx2, which was very conservative in terms of type I error control (Fig. 3), showed overall good performance, with the notable exception of the high-diversity WMS dataset (Fig. 5b), for which it was the worst performing method. To sum up, the highest concordance was measured, in all WMS datasets, by the corncob-based and songbird methods, while RNA-seq methods performed better in 16S datasets, confirming that the two platforms yield substantially different data. mixMC was the only method that never showed poor concordance regardless of the technology or of the diversity of the compared groups.

Taken together, these analyses suggest that both BMC and WMC are highly dependent on the amount of DA observed in the dataset: higher DA leads to a higher concordance. Moreover, WMC was similar among the compared methods, indicating that the replicability of the DA results depends more on the strength of DA than on the choice of the method (Fig. 5).

Enrichment analysis

While mock comparisons and random splits allowed us to evaluate model fit and concordance, these analyses do not assess the correctness of the discoveries. Even the method with the highest WMC could nonetheless consistently identify false positive DA taxa.

While the lack of ground truth makes it challenging to assess the validity of DA results in real data, enrichment analysis [33] can provide an alternative solution to rank methods in terms of their ability to identify as significant taxa that are known to be differentially abundant between two groups.

Here, we leveraged the peculiar environment of the gingival site: the supragingival biofilm is directly exposed to the open atmosphere of the oral cavity, favoring the growth of aerobic species. In the subgingival biofilm, however, the atmospheric conditions gradually become strict anaerobic, favoring the growth of anaerobic species [34]. From the comparison of the two sites, we thus expected to find an abundance of aerobic microbes in the supragingival plaque and of anaerobic bacteria in the subgingival plaque. DA analysis should reflect this difference by finding an enrichment of aerobic (anaerobic) bacteria among the DA taxa with a positive (negative) log-fold-change.

We tested this hypothesis by comparing 38 16S supragingival and subgingival samples (for a total of 76 samples) from the HMP (see the “Methods” section for details). The DA methods showed a wide range of power, identifying 2 (ALDEx2) through 305 (metagenomeSeq) significantly DA taxa (Fig. 6a). However, almost all methods correctly found an enrichment of anaerobic microbes among the taxa under-abundant in supragingival and an enrichment of aerobic microbes among the over-abundant ones (Fig. 6a; Additional file 1: Supplementary Fig. S7). Furthermore, as expected, no enrichment was found for facultative anaerobic microbes, which are able to switch between aerobic and anaerobic respiration (Fig. 6a).

Although most methods performed well, scde, ALDEx2, and MAST had too low power to detect any enrichment (at 0.05 significance level), as their number of identified DA taxa was very low (Fig. 6a). This analysis confirmed the conservative behavior of these methods in 16S data (Fig. 3e). Finally, metagenomeSeq and edgeR with robust dispersion estimation found the correct enrichments, but they also identified many anaerobic taxa with a positive log-fold-change (Fig. 6a), confirming their liberal tendencies (Fig. 3e). Overall, these results were confirmed by the same comparison in WMS data (Additional file 1: Supplementary Fig. S8), but the reduced sample size of our WMS dataset resulted in a reduced power to detect DA for all methods (see the “Methods” section).

To explore the ability of each method to correctly rank the DA taxa independently of its power, we tested whether over-abundant aerobic taxa and under-abundant anaerobic taxa were more likely to be ranked at the top when ranking taxa by each method’s test statistics. To do so, we considered the top K taxa (with K from 1 to 20%; see the “Methods” section) and computed the difference between putative true positives (TP; over-abundant aerobic taxa and under-abundant anaerobic taxa) and putative false positives (FP; under-abundant aerobic taxa and over-abundant anaerobic taxa; Fig. 6b). Reassuringly, increasing the threshold resulted in a larger difference between TP and FP for most methods (Fig. 6b), indicating that independently of their power, most methods are able to highly rank true positive taxa. This becomes particularly important for the methods with a low power, suggesting that in these cases a more liberal p value threshold may be applied. However, metagenomeSeq’s performance deteriorates after the 10% threshold, suggesting that this method starts to identify more false positives (Fig. 6b): this is particularly problematic since its adjusted p value threshold identifies 34% of DA taxa. Among the other methods, MAST and ALDEx2 showed a consistently lower performance, while limma-voom was the best performer at permissive thresholds, and songbird was the best performer at strict thresholds (Fig. 6b).

The majority of aerobic taxa were found DA by just a handful of methods, with only 15 out of 75 unique aerobic taxa identified as DA by 3 or more representative methods (see Methods; Fig. 6c). All of them belonged to the genera Cardiobacterium, Neisseria, Lautropia, Corynebacterium, found to be among the most prevalent genera in supragingival plaques in an independent study [35]. On the other hand, 57 out of 161 unique anaerobic taxa were found DA by 5 or more representative methods (see Methods; Fig. 6d; Additional file 1: Supplementary Fig. S9). Among these, Fusobacterium, Prevotella, Porphyromonas, Treponema are known to be abundant in the subgingival plaque [36, 37]. Despite the small sample size for WMS data (n = 10), enrichment and DA analysis were largely consistent, including several strains of Neisseria and several species of Treponema found to be DA (Additional file 1: Supplementary Fig. S8c,d). Overall, similar methods tended to identify a higher number of mutual taxa, confirming our previous findings in the concordance analysis (Additional file 1: Supplementary Fig. S6) and highlighting how different statistical test and normalization approaches have a big impact on the identified DA.

Parametric simulations

Given the results of our GOF analysis (Fig. 2), we only used the NB and ZINB distributions to simulate 7200 and 19,200 scenarios, respectively, mimicking both 16S and WMS data. The simulated data differed in sample size, proportion of DA features, effect size, proportion of zeros, and whether there was an interaction between the amount of zeros and DA (sparsity effect, see the “Methods” section for details).

In general, we found that the results confirmed our expectations that methods perform well on simulated data that conforms to the assumptions of the method (Additional file 2: Supplementary Fig. S11). The parametric distribution that generated the data had a great influence on the method performances and the methods that rely on NB and ZINB generally performed better compared to the other methods. As an example, MAST, which showed overall good results in real data, did not behave in simulations, partly because of the misspecified model with respect to the data generating distribution.

As expected, all methods’ performances increased as the sample size and/or the effect size increased. Confirming our real data results, we finally observed that metagenomeSeq, scde, and edgeR-robust performed poorly. Details on the simulated data analysis can be found in Additional file 2.

Discussion

We investigated different theoretical and practical issues related to the analysis of metagenomic data. The main objective of the study was to compare several DA detection methods adapted from bulk RNA-seq, single-cell RNA-seq, or specifically developed for metagenomics. Unsurprisingly, there is no single method that outperforms all others in all the tested scenarios. As is often the case in high-throughput biology, the results are data-dependent and careful data exploration is needed to make an informed decision on which workflow to apply to a specific dataset. We recommend applying our exploratory analysis framework to gain useful insights about the assumptions of each method and their suitability given the data at hand. To this end, we provide all the R scripts to easily reproduce the analyses of this paper on any given dataset (see the "Availability of data and materials" section).

Our GOF analysis highlighted the advantages of using count models for the analysis of metagenomics data. The goodness of fit of zero-inflated models seemed dependent on whether the data come from 16S or WMS experiments. The difference between these two approaches translates to different count data structures: while for WMS many features are characterized by a clearly visible bimodal distribution (with a point mass at zero and another mass, quite far from zero, at the second positive mode), 16S data are as sparse as or even more sparse than WMS data, presenting for many features a less clearly bimodal distribution (Additional file 1: Supplementary Fig. S4a). This difference is probably due to a mix of factors: primarily sequencing depth, but also different taxonomic classification between technologies (entire metagenomic sequences versus clusters of similar amplicon sequences), bioinformatics methods for data preprocessing, etc. However, comparing the distribution of several genera on the same samples assayed with 16S and WMS, we observed that many of the zero counts were consistent across platforms and very different read depths, suggesting that many observed zeros are biological and not technical in nature (Additional file 1: Supplementary Fig. S4a). Further analyses are needed to inspect this unsolved issue and related efforts are ongoing in the single-cell RNA-seq literature, where similar differences are observed between protocols with and without unique molecular identifiers [38, 39].

Metagenomic data are inherently compositional, but whether incorporating compositionality into the statistical model provides benefits greater than the tradeoffs they may introduce is a debated topic in the literature [9, 13, 40,41,42]. While other data resulting from sequencing are also compositional, some in the microbiome data analysis community believe that compositionality has greater relevance in metagenomics due to the potential presence of dominant microbes. Here, we found that compositional methods did not outperform non-compositional methods designed for count data, indicating that their benefits did not outweigh the drawbacks they may introduce. This can be explained by two considerations. First, some compositional methods assume that the data arise from a multinomial distribution, with n trials (reads) and a vector p indicating the probability of the reads to be mapped to each taxon. In metagenomic studies, we have a large n (number of sequenced reads) and small p (since there are many taxa, the probability of each read to map to any given taxon is small). In this setting, the Poisson distribution is a good approximation of the multinomial. Similarly, the negative binomial is a good approximation of the Dirichlet-Multinomial [31]. Secondly, some normalizations, such as the geometric mean method implemented in DESeq2 or the trimmed mean of M-values of edgeR, have size factors mathematically equivalent or very similar to the centered log-ratio proposed by Aitchison [40, 43]. This has been shown to reduce the impact of compositionality on DA results [44]. We did not test the ANCOM package [45] because it was too slow for assessment. However, we included three recent analysis methods that address compositionality, namely, ALDEx2, songbird, and mixMC. This allowed us to perform an adequate assessment of compositional vs. non-compositional approaches. Similarly, multivariate methods, such as songbird and mixMC, did not outperform methods based on univariate tests, suggesting that these simpler approaches are often sufficient to detect the most relevant biological signals.

The lack of ground truth makes the assessment of DA correctness very challenging. However, we can rely on mock datasets, within-method concordance, and enrichment analysis to obtain a principled ranking of method performances (Fig. 7). Although each analysis by itself does not imply correctness, taken together these assessments are a good proxy to evaluate methods performances in terms of their ability to limit the amount of false discoveries, give replicable results in datasets contrasting the same groups, and identify as significant the taxa that are expected to be DA.

The parametric simulation framework is useful to inspect how individual characteristics of the data-generating distribution impact the sensitivity and specificity of the methods. As the entire analysis was supported by real data, we decided to focus only on a very simple but easily reproducible implementation of the NB and ZINB distributions for the simulations. The choice was justified by our GOF analysis on real datasets. Unsurprisingly, the sample size and the effect size were the characteristics that had the most impact on method performances. This translates into an evident suggestion for experimental design: large sample sizes are needed to detect low effect sizes. Our simulation framework can in principle be used for power calculations in the context of DA analysis.

In the 16S dataset used for the enrichment analysis, with a total of 76 samples and almost 900 unique taxa, the most time-consuming methods were scde and songbird with more than 5 min needed to identify DA taxa. ALDEx2 and corncob-based methods took about 40 s, zinbwave-weighted methods took approximately 20 s while mixMC, MAST and seurat_wilcoxon around 10 s. DESeq2 and edgeR were under the 10 s with limma-voom which was the fastest method taking less than a second (Fig. 7). A consistent ranking was found in simulated datasets with interesting changes determined by different sample-sizes (Additional file 2: Supplementary Table S5 and Supplementary Fig. S10).

Conclusions

As already noted in recent publications [10,11,12], the perfect method does not exist. However, taken together, our analyses suggested that limma-voom, corncob, and DESeq2 showed the most consistent performance across all datasets, metagenomeSeq had the worst performance, and scde and ALDEx2 suffered from low power (Fig. 7). Among compositional data analysis methods, songbird showed a greater ability to identify the correct taxa in the enrichment analysis, while mixMC had a better within-method concordance.

In general, we recommend a careful exploratory data analysis and we present a framework that can help scientists make an informed choice in a dataset-specific manner. We did not find evidence that bespoke differential abundance methods outperform methods developed for the differential expression analysis of RNA-seq data. However, our analyses also suggested that further research is required to overcome the limitations of currently available methods: in this respect, new directions in DA method development, e.g., leveraging the phylogenetic tree [46, 47], log-contrast models [48], or compositional balances [49] are promising, but efforts to make these methods scalable are needed.

Methods

Datasets

The HMP16Sdata [25] (v1.2.0) and curatedMetagnomicData [26] (v1.12.3) Bioconductor packages were used to download high-quality, uniformly processed, and manually annotated human microbiome profiles for thousands of people, using 16S and Whole Metagenome shotgun sequencing technologies, respectively. HMP16SData comprises the collection of 16S data from the Human Microbiome Project (HMP), while curatedMetagnomicData contains data from several projects. Gene-level counts for a collection of public scRNA-seq datasets were downloaded from the scRNAseq (v1.99.8) Bioconductor package.

While the latter datasets are used only for a comparison between technologies, the former are widely used for all the analyses. A complete index with dataset usage is reported in Additional file 1: Supplementary Table S1.

Phyloseq objects were obtained from the HMP16SData and curatedMetagenomicData packages using the function as_phyloseq() and setting the bugs.as.phyloseq = TRUE argument, respectively. The otu_table and sample_data slots of the phyloseq objects that contain, respectively, the taxa count table and the metadata associated to each sample were used for all downstream analyses. For the WMS datasets, absolute raw count data were estimated from the metaPhlAn2-produced relative count data by multiplying the columns of the ExpressionSet data by the number of reads for each sample, as found in the pData column “number_reads” (counts = TRUE argument).

HMP16SData was split by body subsite in order to obtain 18 separated datasets. Stool and Tongue Dorsum datasets were selected for example purposes thanks to their high sample size. The same was done on curatedMetagenomicData HMP dataset, obtaining 9 datasets. Moreover, for the evaluation of type I error control, 41 stool samples with equal RSID, in both 16S and WMS, were used to compare DA methods. For each research project, curatedMetagenomicData was split by body site and treatment or disease condition, in order to create homogeneous sample datasets. A total of 82 WMS datasets were created.

A total of 100 datasets were evaluated; however, for the CAT analysis, datasets not split by condition or body subsite were evaluated (e.g., Tongue Dorsum vs. Stool in HMP, 2012 for both 16S and WMS).

To consider the complexity and the variety of several experimental scenarios, an attempt to select a wide variety of datasets for the analysis was done. The datasets were chosen based on several criteria: sample size, homogeneity of the samples, or availability of the same subjects (identified by RSID) assayed by both technologies.

Statistical models

The following distributions were fitted to each dataset, either by directly modeling the read counts or by first applying a logarithmic transformation:

Negative binomial (NB) model, as implemented in the edgeR (v3.24.3) Bioconductor package (on read counts);
Zero-inflated negative binomial (ZINB), as implemented in the zinbwave (v1.4.2) Bioconductor package (on read counts);
Truncated Gaussian hurdle model, as implemented in the MAST (v1.8.2) Bioconductor package (on log count);
Zero-inflated Gaussian (ZIG), as implemented in the metagenomeSeq (v1.24.1) Bioconductor package (on log count).
Dirichlet-multinomial (DM), as implemented in the MGLM (v0.2.0) CRAN R package.

Negative binomial (NB)

The edgeR Bioconductor package was used to implement the NB model. In particular, normalization factors were calculated with the Trimmed Mean of M-values (TMM) normalization [50] using the calcNormFactors function; common, trended, and tagwise dispersions were estimated by estimateDisp, and a negative binomial generalized log-linear model was fit to the read counts of each feature, using the glmFit function.

Zero-inflated negative binomial (ZINB)

The zinbwave Bioconductor package was used to implement the ZINB model. We fitted a ZINB distribution using the zinbFit function. As explained in the original paper, the method can account for various known and unknown technical and biological effects [23]. However, to avoid giving unfair advantages to this method, we did not include any latent factor in the model (K = 0). We estimated a common dispersion for all features (common_dispersion = TRUE) and we set the likelihood penalization parameter epsilon to 1e10 (within the recommended set of values [24]).

Truncated Gaussian Hurdle model

We used the implementation of the MAST Bioconductor package. After a log2 transformation of the reascaled counts with a pseudocount of 1, a zero-truncated Gaussian distribution was modeled through generalized regression on positive counts, while a logistic regression modeled feature expression/abundance rate. As suggested in the MAST paper [7], cell detection rate (CDR) which is computed as the proportion of positive count features for each sample, was added as a covariate in the discrete and continuous model matrices as a normalization factor.

Zero-inflated Gaussian

The metagenomeSeq Bioconductor package was used to implement a ZIG model for log2 transformed counts with a pseudocount of 1, rescaled by the median of all normalization factors or by 1e03 which gives the interpretation of “count per thousand” to the offsets. The CumNormStat and CumNorm functions were used to perform Cumulative Sum Scaling (CSS) normalization, which accounts for specific data characteristics. Normalization factors were included in the regression through the fitZig function.

Note that both MAST and metagenomeSeq were applied to the normalized, log-transformed data. We evaluated both models, using their default scale factor \( \mathrm{lo}{\mathrm{g}}_2\left(\frac{\mathrm{counts}\cdot {10}^6}{\mathrm{libSize}}+1\right) \) for MAST and \( \mathrm{lo}{\mathrm{g}}_2\left(\frac{\mathrm{normFacts}}{1000}+1\right) \) for metagenomeSeq, as well as by rescaling the data to the median library size [13], \( \mathrm{lo}{\mathrm{g}}_2\left(\frac{\mathrm{counts}\cdot \mathrm{median}\left(\mathrm{libSize}\right)}{\mathrm{libSize}}+1\right) \) and \( \mathrm{lo}{\mathrm{g}}_2\left(\frac{\mathrm{normFacts}}{\mathrm{median}\left(\mathrm{normFacts}\right)}\right) \), respectively.

Dirichlet-Multinomial

The MGLM package was used to fit a Dirichlet-Multinomial regression model for counts. The MGLMreg function with dist = “DM” allowed the implementation of the above model and the estimation of the parameter values.

Goodness of fit (GOF)

To evaluate the goodness of fit of the models, we computed the mean differences between the estimated and observed values for several datasets.

For each model, we evaluated two distinct aspects: its ability to correctly estimate the mean counts (plotted in logarithmic scale with a pseudo-count of 1) and its ability to correctly estimate the probability of observing a zero, computed as the difference between the probability of observing a zero count according to the model and the observed zero frequencies (zero probability difference, ZPD). We summarized the results by computing the root mean squared error (RMSE) of the two estimators. The lower the RMSE, the better the fit of the model.

This analysis was repeated for 100 datasets available in HMP16SData and curatedMetagenomicData (Table S1 and Additional file 1: Supplementary Fig. S2).

Assuming homogeneity between samples inside the same body subsite or study condition, we specified a model consisting of only an intercept or including a normalization covariate.

Differential abundance detection methods

DESeq2

The DESeq2 (v1.22.2) Bioconductor package fits a negative binomial model for count data. DESeq2 default data normalization is the so-called Relative Log Expression (RLE) based on scaling each sample by the median ratio of the sample counts over the geometric mean counts across samples. As 16S and WMS data sparsity may lead to a geometric mean of zero, it is replaced by nth root of the product of the non-zero counts (which is the geometric mean of the positive count values) as proposed in the phyloseq package [51] and implemented in the DESeq2 estimateSizeFactors function with option type = “poscounts”. We also tested DESeq2 with TMM normalization (see below). As proposed in [24], observational weights were supplied in the weights slot of the DESeqDataSet class object to account for zero inflation. Observational weights were computed by the computeObservationalWeights function of the zinbwave package. To test for DA, we used a likelihood-ratio test (LRT) to compare the reduced model (intercept only) to the full model with intercept and group variable. The p values were adjusted for multiple testing via the Benjamini-Hochberg (BH) procedure. Some p values were set to NA via the cooksCutoff argument that prevents rare or outlier features from being tested.

edgeR

The edgeR Bioconductor package fits a negative binomial distribution, similarly to DESeq2. The two approaches differ mainly in the normalization, dispersion parameter estimation, and default statistical test. We examined different procedures by varying the normalization and the dispersion parameter estimation: edgeR_TMM_standard involves TMM normalization and tagwise dispersion estimation through the calcNormFactors and estimateDisp functions, respectively (with default values). Analogously to DESeq2, “poscounts” normalization was used in addition to TMM in edgeR_poscounts_standard to investigate the normalization impact. We also evaluated the impact of employing a robust dispersion estimation, accompanied with a quasi-likelihood F test through the estimateGLMRobustDisp and glmQLFit functions respectively (edgeR_TMM_robustDisp). As with DESeq2, zinbwave observational weights were included in the weights slot of the DGEList object in edgeR_TMM_zinbwave to account for zero inflation, through a weighted F test. Benjamini-Hochberg correction was used to adjust p values for multiple testing.

Limma-voom

The limma Bioconductor package (v3.38.3) includes a voom function that (i) transforms previously normalized counts to logCPM, (ii) estimates a mean-variance relationship, and (iii) uses this to compute appropriate observational-level weights [21]. To adapt the limma-voom framework to zero-inflation, zinbwave weights have been multiplied by voom weights as done previously [24]. The residual degrees of freedom of the linear model were adjusted before the empirical Bayes variance shrinkage and were propagated to the moderated statistical tests. Benjamini-Hochberg correction method was used to correct p values.

ALDEx2

ALDEx2 is a Bioconductor package (v1.14.1) that uses a Dirichlet-multinomial model to infer abundance from counts [14]. The aldex method infers biological and sampling variation to calculate the expected false discovery rate, given the variation, based on several tests. Technical variation within each sample is estimated using Monte-Carlo draws from the Dirichlet distribution. This distribution maintains the proportional nature of the data while scale-invariance and sub-compositionally coherence of data is ensured by centered log-ratio (CLR). This removes the need for a between-sample normalization step. In order to obtain symmetric CLRs, the iqlr argument is applied, which takes, as the denominator of the log-ratio, the geometric mean of those features with variance calculated from the CLR between the first and the third quantile. Statistical testing is done through Wilcoxon rank sum test, even if Welch’s t, Kruskal-Wallis, generalized linear models, and correlation tests were available. Benjamini-Hochberg correction method was used to correct the p values for multiple testing.

metagenomeSeq

metagenomeSeq is a Bioconductor package designed to address the effects of both normalization and under-sampling of microbial communities on disease association detection and testing feature correlations. The underlying statistical distribution for log₂(count + 1) is assumed to be a zero-inflated Gaussian mixture model. The mixture parameter is modeled through a logistic regression depending on library sizes, while the Gaussian part of the model is a generalized linear model with a sample-specific intercept which represent the sample baseline, a sample-specific offset computed by Cumulative Sum Scaling (CSS) normalization and another parameter which represents the experimental group of the sample. We opted for the implementation suggested in the original publication [13], where CSS scaling factors are divided by the median of all the scaling factors instead of dividing them by 1000 (as done in the Bioconductor package). An EM algorithm is performed by the fitZig function to estimate all the parameters. An empirical Bayes approach is used for variance estimation and a moderated t test is performed to identify differentially abundant features between conditions. Benjamini-Hochberg correction method was used to account for multiple testing.

Corncob

corncob is an R package (v0.1.0 [52]) for the differential abundance and differential variability analysis of microbiome data [17]. Specifically, corncob is designed to account for the challenges of modeling sequencing data from microbial abundance studies. It is based on a hierarchical model in which the latent relative abundance of each taxon is modeled as a beta distribution, and the observed absolute presence of a taxon is modeled as a binomial process with the previously specified beta as the probability of success. This hierarchical structure gives flexibility to the method, which can account for changes in the average count values as well as their dispersion. A generalized linear model framework, with a logit link function, is used to allow the study of covariates in the feature count distributions. The model fit is performed by maximum likelihood using the trust region optimization algorithm [17]. Likelihood-ratio or Wald tests can be used to test the null hypothesis of no DA.

Songbird

songbird is a python package [53] that ranks microbes that are changing the most relative to each other [16]. The method is based on a compositional approach in which the underlying count distribution is assumed to be multinomial. The coefficients from multinomial regression can be ranked to determine which taxa are changing the most between samples. The compositionality is addressed using the differential abundance of each taxon as reference to each other when they are ranked numerically. Since songbird has been developed as an extension tool for Qiime2, we converted all our data tables to the .biom format to serve as input for this method. The authors’ suggested analysis pipeline requires several manual adjustments to the tuning parameters on the basis of the comparison of the results after several runs, making it difficult to implement this method within a benchmarking framework. For this reason, we used the default values for all the tuning parameters.

mixMC

mixMC is a multivariate framework implemented in mixOmics, a Bioconductor package (v6.6.1), for omic data analysis [18]. It handles compositional and sparse data, repeated-measures experiments, and multiclass problems. After the addition of a pseudo-count value of 1, the TSS normalization is applied to the count table and the CLR transformation is performed to account for compositionality. The method is based on a Partial Least Squares (PLS) Discriminant Analysis (DA), a multivariate regression model which maximizes the covariance between linear combinations of the feature counts and the outcome (in our case, a dummy variable indicating the body site/group of each sample). Covariance maximization is achieved in a sequential manner via the use of latent component scores [18]. Each component is a linear combination of the feature counts and characterizes a source of covariation between the feature and the groups. The sparse version of PLS-DA, sPLS-DA uses Lasso penalizations to select the most discriminative features in the PLS-DA model. The penalization is applied component-wise and the resulting selected features reflect the particular source of covariance in the data highlighted by each PLS component. We specified the number of features to select per component at 100 or more, and we optimized it using leave-one-out cross-validation. Since we always compared two groups in this manuscript, only the first component is necessary for the analysis. The multivariate regression coefficients, one for each feature, were ranked in order to obtain the most discriminant features for the first component.

MAST

MAST is a Bioconductor package for managing and analyzing qPCR and sequencing-based single-cell gene expression data, as well as data from other types of single-cell assays. The package also provides functionality for significance testing of differential expression using a Hurdle model. Zero rate represents the discrete part, modeled as a binomial distribution while \( \mathrm{lo}{\mathrm{g}}_2\left(\frac{\mathrm{count}{\mathrm{s}}_{i,j}\cdot \mathrm{median}\left(\mathrm{libSize}\right)}{\mathrm{libSiz}{\mathrm{e}}_j}+1\right) \) where i and j represent the ith feature and the jth sample, respectively, is used for the continuous part, modeled as a Gaussian distribution. The kind of data considered, different from scRNA-seq, does not allow the usage of the adaptive thresholding procedure suggested in the original publication [7]. Indeed, because of the amount of feature loss, if adaptive thresholding is applied, the comparison of MAST with other methods would be unfair. However, a normalization variable is included in the model. This variable captures information about each feature sparsity related to all the others; hence, it helps to yield more interpretable results and decreases background correlation between features. The function zlm fits the Hurdle model for each feature: the regression coefficients of the discrete component are regularized using a Bayesian approach as implemented in the bayesglm function; regularization of the continuous model variance parameter helps to increase the robustness of feature-level differential expression analysis when a feature is only present in a few samples. Because the discrete and continuous parts are defined conditionally independent for each feature, tests with asymptotic χ² null distributions, such as the likelihood-ratio or Wald tests, can be summed and remain asymptotically χ², with the degrees of freedom of the component tests added. Benjamini-Hochberg correction method was used to correct p values.

Seurat with Wilcoxon rank sum test

Seurat (v2.3.4) R package is a data analysis toolkit for the analysis of single-cell RNA-seq [22]. Briefly, counts were scaled, centered, and LogNormalized. Wilcoxon rank sum test for detecting differentially abundant features was performed via the FindMarkers function. Rare features, which are present in a fraction lower than 0.1 of all samples, and weak signal features, which have a log fold change between conditions lower than 0.25, are not tested. Benjamini-Hochberg correction method was used to correct p values.

SCDE—single-cell differential expression

The scde Bioconductor package (v1.99.1) with flexmix package (v2.3-13) implements a Bayesian model for scRNA-seq data [8]. Read counts observed for each gene are modeled using a mixture of a negative binomial (NB) distribution (for the amplified/detected transcripts) and low-level Poisson distribution (for the unobserved or background-level signal of genes that failed to amplify or were not detected for other reasons). The scde.error.models function was used to fit the error models on which all subsequent calculations rely. The fitting process is based on a subset of robust genes detected in multiple cross-cell comparisons. Error models for each group of cells were fitted independently (using two different sets of “robust” genes). Translating in a metagenomic context, cells correspond to samples and genes to taxa or amplicon sequence variants. Some adjustments were needed to calibrate some function default values such as the minimum number of features to use when determining the expected abundance magnitude during model fitting. This option, defined by the min.size.entries argument, set by default at 2000, was too big for many 16S or WMS experiment scenarios: as we usually observe around 1000 total features per dataset (after filtering out rare ones), we decided to replace 2000 with the 20% of the total number of features, obtaining a dataset-specific value. Particularly, poor samples may result in abnormal fits and were removed as suggested in the scde manual. To test for differential expression between the two groups of samples a Bayesian approach was used: incorporating evidence provided by the measurements of individual samples, the posterior probability of a feature being present at any given average level in each subpopulation was estimated. To moderate the impact of high-magnitude outlier events, bootstrap resampling was used and posterior probability of abundance fold-change between groups was computed.